A Curated Image Parameter Dataset from Solar Dynamics Observatory Mission

Draft version June 5, 2019 Typeset using L A T E X tw o column style in AAST eX62 A Curated Image P arameter Dataset from Solar Dynamics Observ atory Mission Azim Ahmadzadeh , 1 Dustin J. Kempton , 1 and Raf al A. Angr yk 1 1 Ge or gia State University Atlanta, GA 30302, USA ABSTRA CT W e provide a large image parameter dataset extracted from the Solar Dynamics Observ atory (SDO) mission’s AIA instrument, for the p erio d of January 2011 through the current date, with the cadence of six minutes, for nine w av elength c hannels. The volume of the dataset for each year is just short of 1 TiB. T ow ards achieving b etter results in the region classiﬁcation of activ e regions and coronal holes, w e improv e up on the p erformance of a set of ten image parameters, through an in depth ev aluation of v arious assumptions that are necessary for calculation of these image parameters. Then, where p ossible, a metho d for ﬁnding an appropriate settings for the parameter calculations was devised, as w ell as a v alidation task to show our impro v ed results. In addition, we include comparisons of JP2 and FITS image formats using supervised classiﬁcation mo dels, by tuning the parameters sp eciﬁc to the format of the images from which they are extracted, and sp eciﬁc to each w a v elength. The results of these comparisons show that utilizing JP2 images, which are signiﬁcantly smaller ﬁles, is not detrimental to the region classiﬁcation task that these parameters w ere originally in tended for. Finally , we compute the tuned parameters on the AIA images and pro vide a public API a) to access the dataset. This dataset can b e used in a range of studies on AIA images, such as conten t-based image retriev al or tracking of solar ev ents, where dimensionalit y reduction on the images is necessary for feasibilit y of the tasks. Keywor ds: metho ds: data analysis — Sun: ﬂares — Sun: general — techniques: image pro cessing 1. INTR ODUCTION Near real-time monitoring and recording of the Sun’s activities has opened new do ors for solar physicists to b etter understand the physics of diﬀerent solar even ts. This was made p ossible in F ebruary 2010, when the So- lar Dynamic Observ atory (SDO) ( P esnell et al. 2012 ) w as launc hed as the ﬁrst mission of NASA’s Living With a Star (L WS) Program, whic h is a long term pro ject dedicated to the study of the Sun and its impact on h uman life ( With bro e 2013 ). The SDO mission is in- v aluable for monitoring of space weather and prediction of solar even ts which pro duce high energy particles and radiation. Such activieis can hav e signiﬁcan t impacts on space and air tra v el, p ow er grids, GPS, and comm unica- tions satellites ( Council 2008 ). SDO started capturing and transmitting to earth, appro ximately 70 , 000 high- resolution images of the Sun, p er day , or ab out 0 . 55 Corresponding author: Azim Ahmadzadeh aahmadzadeh1@cs.gsu.edu, dkempton1@cs.gsu.edu a) See: http://dmlab.cs.gsu.edu/dmlabapi/ p etab ytes of data p er year ( Martens et al. 2012 ). This v olume of data will only increase in time and with future missions. It is simply infeasible to take full adv an tage of such a large collection of data by traditional, human- based analysis of the images. Making it p ossible for solar physicists to extract information and kno wledge from such a large volume of data, brings new challenges to other domains such as database management, com- puter vision, mac hine learning, and many others. One of the primary ob jectiv es for improving the us- abilit y of suc h a large dataset is to reduce the size of the L1.5 FITS data without a signiﬁcan t loss of the information con tained within the data. This can b e done b y utilizing either data compression algorithms or feature extraction (i.e., summarization) techniques, or b oth. While the features can be extracted from the high- est qualit y of av ailable data (in our study for instance, from AIA images in FITS format that we will discuss thoroughly later), the images ma y only be needed in smaller sizes or in compressed formats such as JP2000 or JPG. Of course, diﬀerent approaches must b e tailored for diﬀeren t tasks for which the data is b eing prepared, 2 Ahmadzadeh et al. T able 1. The ten image parameters computed on the AIA im- ages used to pro duce the dataset. Image Parameter F ormula 1 En tropy − P L i =0 p ( i ) · log 2 ( p ( i )) 2 Mean ( µ ) P L i =0 h ( i ) · i 3 Standard Deviation ( σ ) q P L i =0 h ( i ) · ( i − µ ) 4 F ractal Dimension − lim ε → 0 log( N ) log( ε ) 5 Sk ewness ( µ 3 ) 1 σ 3 P L i =0 h ( i )( i − µ ) 3 6 Kurtosis ( µ 4 ) 1 σ 4 P L i =0 h ( i )( i − µ ) 4 7 Uniformit y P L i =0 p 2 ( i ) 8 Relativ e Smo othness 1 − 1 1+ σ 2 9 T amura Contrast σ 2 µ 4 0 . 25 10 T amura Directionality See Eq. 3 L : maximum intensit y v alue (e.g. 255), i : color intensit y v alue ( i ∈ [0 , L ]), p : probabilit y (i.e., normalized histogram), h : histogram, N : n umber of counting b o xes, ε : side length of the counting b ox but an appropriate data reduction is extremely b eneﬁ- cial regardless. By signiﬁcan tly reducing the size of the dataset, man y useful tasks are made p ossible that previously may ha ve b een to o costly to compute, if at all. T o name a few, this would pav e the road for a more eﬃcient search and retriev al of images, clustering of similar regions of im- ages across a wider temporal window, classiﬁcation of solar ev ents based on their regional texture, tracking of diﬀeren t ev en ts in time, and even real-time prediction of solar phenomena, for which the total computation time must comply with the streaming rate of the SDO images. Suc h reduction in size not only allows faster op erations but also keeps the fo cus on some key asp ects of the data, called features. Reducing the raw data in to some important features is crucial o wing to the fact that image rep ositories inherit the ‘curse of dimensionality’ as ev ery pixel is represen ted in one dimension. These high dimensional spaces are problematic as they may yield misleading results in an y analysis that requires sta- tistical signiﬁcance, and this expands to aﬀect almost all machine learning tec hniques ( T runk 1979 ; Hinneb- urg et al. 2000 ; V erleysen & F ran¸ cois 2005 ). The curse is attributed to the situation where the gro wth in dimen- sionalit y of the data space is so fast that the n umber of a v ailable data samples cannot prop erly ﬁll up the high dimensional space, which renders mac hine learning mo d- els p ow erless. Another imp ortant outcome of reducing the data volume is that b y providing a more manageable data rep ository that can be easily accessed and managed b y any one without needing large and exp ensive storage devices or being highly skilled in dealing with ‘big data’, more researc hers from diﬀerent domains ma y b e encour- aged to run diﬀerent exp eriments on this collection of data and p ossibly pro vide more insight ab out the data. T o b e able to more eﬃciently and accurately extract a set of imp ortant features from SDO’s image data, v ari- ous means of data mining should be utilized. This study builds up on a stac k of techniques to deriv e the imp or- tan t image parameters, for the entire collection that is con tinuously b eing up dated, starting from 2011. Pre- pro cessing of the original (L1.5) AIA image data, in- tegrating the data with the spatiotemp oral information suc h as the detected b ounding b oxes of diﬀeren t solar ev ents’ instances and the time stamp of their o ccur- rences, extracting the imp ortant characteristics of the images, and labeling the instances are some of the ma jor steps we take to transform the original data to the data that can b e fed into the machine learning mo dels. W e utilize supervised learning to tune the features to reach their highest p erformance in classifying tw o important solar even ts’ instances, namely activ e regions and coro- nal holes. In addition, we provide a comparative analy- sis betw een the extracted features from diﬀeren t image formats, in terms of their qualit y in distinguishing diﬀer- en t solar even ts. In addition to providing the dataset as our primary goal, we hop e that our detailed discussion on these topics would be informativ e for scien tists inter- ested in SDO images, or extraction of image parameters in general. Releasing the ﬁnal dataset in the form of a public API will make the image-based analysis of the solar ev en ts easier and may op en new do ors to not only solar physi- cists but also computer scientists who are in terested in feeding their mo dels with a dataset diﬀeren t than the existing, general-purp ose, image rep ositories. The remaining of this paper is organized in the fol- lo wing wa y: A bac kground ov erview on SDO data and the image parameters that we are in terested in is pre- sen ted in Section 2 . In Section 3 , we explain the dif- feren t sources we retrieve the data from and discuss the image types w e run our mo dels on. W e then in Sec- tion 4 , analyze eac h of the image parameters and their v ariables whic h require tuning. The tuning pro cess, and its ev aluation using sup ervised learning, is presented in Solar Image P arameter Da t a 3 4 0 9 6 p i x e l s ( 6 4 c e l l s ) 64 p i x el s Figure 1. Grid-based segmentation of an AIA image with a grid of 64 × 64 cells, each of side length 64 pixels. As an example, the mean image parameter is calculated on each cell and the resultant 64 × 64-pixel heat-map of the output is sho wn on the bottom-right corner. The heat-map is enlarged for a b etter visibility . Section 5 . After ﬁnding the b est setting for each of the image parameters, w e pro vide a thorough analysis of the pro duced data in Section 6 . Section 7 concludes this w ork and discuss the future w ork. And ﬁnally , in A , w e presen t some statistical analysis of the created dataset to paint a more accurate picture of the reliability and usabilit y of the data. 2. BA CKGR OUND The Solar Dynamic Observ atory (SDO) was launched on F ebruary 11, 2010, as the ﬁrst mission of NASA’s Living With a Star (L WS) Program, with a ﬁv e-year prime mission lifetime. The main goal of this pro ject is to better understand the ph ysics of solar v ariations that inﬂuence life and society . No w that it has b een close to a decade since its launch, the observ atory has pro vided us with appro ximately 4 p etabytes of data in total and is currently contin uing to record even more. The Atmospheric Imaging Assembly (AIA), as one of the three SDO instrumen ts, fo cuses on the evolution of the magnetic environmen t in the Sun’s atmosphere and its interaction with embedded and surrounding plasma ( Lemen et al. 2012 ). The AIA images archiv ed in the Join t SDO Op era- tions Cen ter (JSOC) 1 science-data processing (SDP) facilit y , ha v e b een pro cessed by the SDO F eature Find- ing T eam (FFT) 2 ( Martens et al. 2012 ) using its 16 p ost-pro cessing mo dules. The mo dules are designed for detection of solar ev ent classes suc h as ﬂares, activ e re- gions, ﬁlamen ts, and CMEs, in near real-time, and oth- ers suc h as coronal holes, sunspots, and jets. The results are posted at least twice a day to the Helioph ysics Ev en t Kno wledgebase (HEK) system ( Hurlburt et al. 2010 ) since March 2010. One of the FFT’s mo dules, which tar- gets AR and CH even ts is called SP oCA suite ( V erb eec k et al. 2014 ). SP oCA, or Spatial P ossibilistic Clustering Algorithm, is run in near-real time at Lo c kheed Martin Solar and Astroph ysics Laboratory and rep orts to the AR and CH catalogs of the HEK. It works on a v ariet y of data sources including SDO’s AIA images. SPoCA segmen ts EUV images in to three classes, namely , AR, CH, and QS. That is, it even tually attributes each pixel to one of the three classes, after running diﬀerent fuzzy clustering algorithms on the images and applying some pre- and p ost-pro cessing ﬁlters. Due to the size of the dataset pro duced b y the SDO, an eﬃcient search and retriev al system o v er the entire arc hive is a necessity . In 2010, this issue w as ﬁrst ex- plored b y Banda et al., and the ambitious task of cre- ating a Con tent-Based Image Retriev al (CBIR) system on the SDO AIA images w as started ( Banda & An- gryk 2010a ). Giv en the v olume and v elo city of the data stream, the ten b est image parameters (listed in T a- ble 1 ) were chosen based on their eﬀectiv eness in clas- siﬁcation of the solar ev en ts and also their pro cessing time ( Banda & Angryk 2010b ). The concern regard- ing the running time of the implemented parameters is ro oted in the ultimate goal of near real-time pro cessing of the data and the prediction of solar even ts. The pro- cessing windo w is therefore b ounded by the rate of eigh t 4096 × 4096-pixel images b eing transmitted to earth ev- ery 10 seconds. The p erformance of these parameters w as further exp erimented and conﬁrmed b y Banda et al. ( 2011 , 2013 ). Due to the v ariet y of issues that must b e addressed for a reliable CBIR system to b e created, this is still an activ e researc h with the latest up date in Sc huh et al. ( 2017 ). In addition to the analysis performed in the previously men tioned works, these parameters hav e also b een used 1 JSOC; joint betw een Stanford and the Lo ckheed Martin Solar and Astrophysics Lab oratory (LMSAL) 2 An in ternational consortium groups selected b y NASA to pro- duce a comprehensive set of automated feature recognition mo d- ules. 4 Ahmadzadeh et al. (a) Entrop y (b) Mean (c) Std. Deviation (d) F ractal Dim (e) Skewness (f ) Kurtosis (g) Uniformity (h) Rel. Smoothness (i) T. Contrast (j) T. Directionality Figure 2. Heatmap plots of the ten image parameters extracted from an AIA JP2 image captured on 2017-09-06 at 12:55:00, from the 171– ˚ A channel. for the classiﬁcation of ﬁlamen ts in H-alpha images from the Big Bear Solar Observ atory (BBSO) and similar suc- cess w as reported by Sch uh & Angryk ( 2014 ). Sch uh et al. also emplo y ed these ten image parameters for the dev elopment of a trainable mo dule for use in the CBIR system ( Sch uh et al. 2015 ), along with a thorough analy- sis on three years of SDO data (from Jan 1, 2012 through Dec 31, 2014). Y et another sequence of studies b eneﬁts from the same set of image parameters for tracking of the solar phenomena in time ( Kempton & Angryk 2015 ; Kempton et al. 2016a , 2018 ). In that work, their track- ing mo del utilize sparse co ding to classify solar ev en t detections as either the same detected ev ent at a later time or an entirely diﬀerent solar ev en t of the same t yp e. This mo del links the individually rep orted ob ject detec- tions into sets of ob ject detection reports called tracks, using a multiple h ypothesis tracking algorithm. This w as accomplished through the consideration of the same set of image parameters on whic h we concen trate in this study . W e hop e that our thorough analysis, whic h re- sults in a signiﬁcant improv emen t in eﬀectiveness of the ten image parameters, helps all of the ab ov e studies in their p erformance noticeably . 2.1. Image Par ameters All parameters in T able 1 , except for fractal dimen- sion and T amura directionality , capture some informa- tion about the distribution of the pixel intensit y v al- ues of the images and none of them preserve the spatial information of the pixels. Even though the spatial in- formation is not preserv ed, the distribution-related data pro vide many clues as to the characteristics of the image. F or example, a narro wly distributed histogram indicates a low-con trast image. A bimo dal distribution often sug- gests that the image contains an ob ject or a region with a narro w amplitude range against a background of dif- fering amplitude. Ho wev er, the location and shape of the solar phenomena, similar to the temp oral informa- tion, are the crucial asp ects of ou r data. In order to help preserv e some of the spatial information of the data, we apply a grid-based segmen tation on the images. This is a widely used technique already exp erimented on the AIA images b y Banda & Angryk ( 2009 , 2010b ) that has sho wn go o d results. Each 4096 × 4096-pixel AIA im- age is segmen ted by a ﬁxed 64 × 64-cell grid. F or each grid cell that spans ov er a square of 64 × 64 pixels of the image, the 10 image parameters will b e calculated. In Fig. 1 , such segmentation, as well as the heat-map of the mean parameter ( µ ) as an example, is visualized. Since w e are pro cessing 10 parameters for eac h image, (see Fig. 2 ), the image then forms a data cub e of size 64 × 64 × 10. Additionally , for eac h time step, we process 9 images from diﬀerent wa v elength ﬁlter channels of the AIA instrumen t. The image parameters can also b e categorized in tw o main groups; those whic h describ e purely statistical c haracteristics of an image and those that capture the textural information. The former further divides into Solar Image P arameter Da t a 5 t wo subcategories: 1) P arameters such as mean, stan- dard deviation, sk ewness, kurtosis, relative smo othness, and T amura con trast that solely dep end on the pixel in tensity v alues of the image, 2) P arameters suc h as uniformit y and entrop y , that, in addition to the pixel v alues, dep end on the choice of the bin size required for construction of the normalized histogram of the color in- tensities 3 . The latter captures the characteristics of the image texture within the regions of in terest (i.e., solar ev ents). In the following text, we elab orate more on the four image parameters whic h require a deep er attention. 2.1.1. Entr opy En tropy , as an image parameter, has b een widely uti- lized in a v ariety of in terdisciplinary studies ranging from medical images ( Pluim et al. 2003 ) to astronom- ical ( Starck et al. 2001 ) and satellite ( Barbieri et al. 2011 ) images. Dep ending on the sp eciﬁc goal in each study , diﬀeren t approaches might be needed. All of the suggested mo dels try to measure the disorder or uncer- tain ty of pixel v alues in an image (or bits of data in general). Almost all of them are inspired, one w a y or another, from the deﬁnition of entrop y introduced b y Shannon ( 2001 ) of the Information Theory domain. De- spite the v aluable achiev emen ts in this direction, the Monk ey Mo del En tropy (MME) ( Justice 1986 ; Skilling 1989 ) which is iden tical to what Shannon introduced for deco ding communication bits, is still the most pop- ular mo del in the image pro cessing comm unity . In this mo del, the random v ariable i x,y , i.e., the intensit y v alue of the pixel at p osition ( x, y ), is assumed to b e indep en- den t and identically distributed (i.i.d) and therefore the en tropy is measured as follows: en tropy MME = − L X i =0 p ( i ) · log 2 ( p ( i )) (1) where p is the probability distribution function of the pixel in tensity v alue i , and L is the n um b er of gray levels min us one (e.g., 255 for a t ypical 8-bit quantized image). This can be computed directly from the intensit y-based histogram of an image. As an in tuitiv e in terpretation of this parameter, one could sa y that an image with lo w entrop y is more homogeneous than one with higher en tropy . This mo del of entrop y was utilized previously by Banda et al., as one of ten selected image parameters 3 Note that in T able 1 , in order to hav e a uniﬁed form ulation for diﬀerent parameters, whenever possible we used the histogram function (i.e., h ( i )) to form ulate the parameter, how ever, it is only for tw o parameters, namely uniformity and entrop y , that the calculation of the normalized histogram (i.e., p ( i )) is necessary . in their research ( Banda & Angryk 2010a ). It is w orth noting that we are aw are of the fact that the assumption of i.i.d pixel intensities disregards the presence of spa- tial order or contextual dep endency of the image pixels, ho wev er, the segmentation step discussed ab ov e provides some comp ensation for this loss of spatial information. In addition, the simplicity of this mo del is in line with the previously discussed fo cus on prioritizing the compu- tation cost of the parameter choices. The MME is indeed the simplest mo del and can b e computed faster than other approac hes, for instance, those which require the computation of the joint probabilit y distribution func- tion of the pixel v alues ( Razlighi & Kehtarna v az 2009 ). 2.1.2. Uniformity Similar to en tropy , uniformit y is also a p opular sta- tistical measure that is widely used to quantify the ran- domness of the color intensities and to c haracterize the textural prop erties of an image. Uniformity is calculated as: uniformit y = L X i =0 p 2 ( i ) (2) and reac hes its highest v alue when gray level distribu- tion has either a constan t or a perio dic form ( Da vis et al. 1979 ). In this formula, the v ariables p , i , and L are sim- ilar to those in Eq. 1 , where p is the probability distri- bution function of the pixel intensit y v alue i , and L is the n umber of gray levels minus one. 2.1.3. F r actal Dimension F ractal dimension is another well-kno wn measure uti- lized by scientists of diﬀeren t domains. How ever, un- lik e the parameters discussed so far which are purely statistical measures, fractal dimension (and T amura di- rectionalit y) fo cus more on the textural asp ects that we b eliev e are in particular imp ortance for distinction of at least some of the solar phenomena, such as activ e regions and coronal holes. Whenever it comes to analyzing sci- en tiﬁc image data, this parameter seems to b e a useful c hoice. In solar physics, as a relev ant example, fractal dimension was used for a v ariety of purp oses including detection of active regions ( Rev ath y et al. 2005 ), and to exhibit fractal scaling of solar ﬂares in EUV wa v elength c hannels ( Asch w anden & Asc h wanden 2008 ). Historically , fractal dimension w as once used as a clev er solution to a problem that is no w known as the coastline parado x ( W eisstein 2008 ). It was the idea of measuring the length of the coast of Britain, indep en- den t from the scale of measurement ( Mandelbrot 1967 ), that pro vided the basis for the deﬁnition of this parame- ter. F ractal dimension is a measure of nonlinear gro wth, whic h reﬂects the degree of irregularit y o v er multiple 6 Ahmadzadeh et al. scales. In other words, it measures the complexity of fractal-lik e shap es or regions. A larger dimension indi- cates a more complex pattern while a smaller quan tity suggests a smoother and less noisy structure. Among the sev eral diﬀeren t metho ds for measuring the fractal di- mension ( Annadhason 2012 ), the b ox counting metho d, also kno wn as Mink owski-Bouligand dimension, is the most p opular one. The general approach for the b ox counting metho d can b e describ ed as follows. The fractal surface, in an n -dimensional space, is ﬁrst partitioned with a grid of n -cub es with the side length of ε . Then, N ( ε ) is used to denote the num ber of n -cub es ov erlapping with the fractal structure. The counting pro cess is then rep eated for the n -cub es of diﬀerent sizes, and the slop e β of the regression line ﬁtting the plot of ε against N ( ε ) giv es the dimension of this fractal. In a 2-D space suc h as ours, the n -cub es are simply squares with a side length of ε . More details of employing this parameter for measuring the complexit y of solar even ts is discussed in Section 4 . 2.1.4. T amur a Dir e ctionality Directionalit y as a texture parameter is a well-kno wn concept in image pro cessing and texture analysis do- mains. This parameter w as extensively inv estigated by Ba jcsy ( 1973 ) and later on b y T amura et al. ( 1978 ). The prop osed method b y T amura, used to measure the direc- tionalit y , has b ecome a p opular texture parameter and has been used in a v ariet y of studies. The w ell-known examples are in QBIC ( Flickner et al. 1995 ) and Photo- b o ok ( Flickner et al. 1995 ) pro jects which are con ten t- based image retriev al (CBIR) systems. Some more do- main sp eciﬁc examples would b e the solar image data b enc hmark gathered by Sch uh & Angryk ( 2014 ) and the tracking of the solar ev ents b y Kempton & Angryk ( 2015 ). In addition to Banda’s work ( Banda & Angryk 2010a ) on ev aluating the eﬀectiveness of T amura direc- tionalit y on AIA solar images, Islam et al. ( 2008 ), a discipline-indep enden t study , show ed that directionality is indeed one of the most imp ortant texture features when the human p erception is considered the ground truth. T amura directionality is a measurement of c hanges in directions visually p erceiv able in image textures. T amura formulated this parameter as follo ws: T dir = 1 − r · n p · n p X p X φ ∈ ω p ( φ − φ p ) 2 · h ( φ ) (3) where: p : a p eak’s index, n p : the total n umber of p eaks, φ p : the angle corresp onding to the p -th p eak, ω p : a neigh b orho o d of angles around the p -th p eak, r : the normalizing factor for quan tization level of φ , φ : the quan tized direction code (cyclically in modulo 180 ◦ ). In the statistical terms, this parameter calculates the w eighted v ariance of the gradient angles, φ , for each p eak, p , of the histogram of angles, h ( φ ), within eac h p eak’s domain, ω p , considering the angle corresp onding to eac h peak be the mean v alue of the angles within that p eak’s domain. It then aggregates across the iden tiﬁed p eaks, and after re-scaling the result to the range [0 , 1], it subtracts the ﬁnal v alue from one to ac hieve a mono- tonically increasing function. That is, it returns greater quan tities for a more directional texture. 3. D A T A SOURCES In order to tune the calculation of image parameters for ac hieving an eﬀectiv e set of features requires an ev al- uation pro cess. The ev aluation pro cess we utilize relies on reported solar even ts to ev aluate the p erformance of eac h image parameter individually for each w a velength c hannel we are utilizing. In order to accomplish this, we use sup ervised learning to measure the p erformance of eac h of the image parameters in detecting some of the solar even ts. In this section, we detail our data sources for our images and the even t-related metadata that was collected. W e also brieﬂy explain the FITS format, a commonly used format in astronomy that is employ ed b y the SDO repository as the primary wa y for digitiz- ing the AIA images. Understanding of the structure of this format and how the AIA images are stored in such format is crucial for our prepro cessing steps. 3.1. HEK: Event Data The Helioph ysics Even t Knowledgebase (HEK) is the source of the spatiotemp oral data used in this study . The HEK system, as a centralized arc hive of solar even t rep orts, is p opulated with the even ts detected by its Ev ent Detection System (EDS) from SDO data. There are considered 18 diﬀerent classes of even ts such as ac- tiv e region, coronal hole, and ﬂare. F or each ev en t class, a unique set of required and optional attributes is de- ﬁned. Eac h even t must hav e a duration and a b ounding b o x that contains the even t in space and time. W e use this information to map the meta data of the rep orted ev ents to the corresp onding AIA images. F or the ev aluation of image parameters p erformed in this study , we utilize tw o of the rep orted solar even t t yp es active region and coronal hole. There are multiple rep orting sources for activ e regions that are rep orted to HEK, and those rep orted by the Space W eather Pre- diction Cen ter (SWPC) of NO AA (National Oceanic Solar Image P arameter Da t a 7 and Atmospheric Administration) are assigned n um b ers daily . The NOAA activ e region observ ations, as Hurl- burt et al. ( 2010 ) explains, is an even t b ounded within a 24-hour time in terv al, and therefore HEK considers all NO AA activ e regions with the same active region n umber to be the same active region. How ever, there is a second automated module from the F eature Finding T eam that rep orts b oth activ e region and coronal holes describ ed b y V erb eeck et al. ( 2014 ) and called the SP oCa mo dule, which rep orts detections every four hours. It is the rep orts from this mo dule that are utilized as the solar ev ents of interest in this study . 3.2. SDO: AIA Image Data The atmospheric imaging assembly (AIA) has four telescop es that provide narro w-band imaging of sev en extreme ultraviolet (EUV) band passes (94– ˚ A, 131– ˚ A, 171– ˚ A, 193– ˚ A, 211– ˚ A, 304– ˚ A, and 335– ˚ A) and tw o UV c hannels (1600– ˚ A and 1700– ˚ A) ( Lemen et al. 2012 ). The captured 4 k images of the Sun, which are full-disk snap- shots with the cadence of 12 seconds, are compressed on b oard and without b eing recorded on orbit, are transmitted to SDO ground stations. The received raw data (Level 0) are archiv ed on magnetic tap es in JSOC science-data pro cessing facility . The uncompressed data is then exp orted as FITS ﬁles with the data represented as 32-bit ﬂoating v alues. At this p oin t, images are al- ready calibrated, how ev er, some corrections and clean- ing are still required due to the existence of a small residual roll angle betw een the four AIA telescop es. A t this stage (Level 1.5), the data is ready for analysis. In some rep ositories including Helioview er, the FITS ﬁles are conv erted to JP2 format to reduce the v olume of their database. In this study , we use the level 1 . 5 (in short L1.5) FITS ﬁles and the JP2 images to ac hieve a comparativ e analysis. In the follo wing subsections, w e elab orate more on how FITS ﬁles are diﬀerent from the JP2 images and wh y a fair comparison should take into accoun t the diﬀerences in the distribution of pixel inten- sities in these t wo image formats. 3.2.1. AIA Images in FITS FITS, short for Flexible Image T ransp ort System, is a data format for recording digital images of scientiﬁc observ ations. This format w as prop osed as a solution to the data transp ort problem. F or details on FITS for- mat we refer the interested reader to W ells & Greisen ( 1979 ). Here, we only mention a few k ey p oints ab out this format to pro vide the basic kno wledge needed for understanding the preprocessing steps that will b e dis- cussed later. F or pro cessing of the FITS ﬁles we use the nom-tam-ﬁts 4 Ja v a library . A FITS ﬁle consists of a header where the basic and optional meta data are stored, and immediately follow- ing that is the data matrix representing the image starts. In the header of AIA images, a plethora of information is stored ( Nigh tingale 2011 ) that migh t b e useful for diﬀer- en t purp oses, such as the minimum and maximum color in tensities, the date of creation of the ﬁle, the exp o- sure time of CCD detectors of the AIA instrument, the name of the telescop e (e.g., SDO/AIA) and the instru- men t (e.g., AIA), wa v elength in units of ˚ Angstroms (e.g., 94– ˚ A), several descriptiv e statistics ab out the captured in tensities, radius of the Sun in pixels on the CCD detec- tors, and so on. It is important to note that unlik e the t ypical 8-bit quantized image formats suc h as JP2, JPG, or PNG, that are limited to 256 diﬀeren t in tensit y levels, the intensit y level in FITS format is only b ound to the sensitivit y of the sensors of the camera. Since the AIA cameras use a 14-bit analog-to-digital conv erter (ADC) to translate the c harge read out from eac h pixel to a dig- ital num b er v alue ( Bo erner et al. 2011 ), the FITS color in tensity v alue has an upper-b ound at 16384 (i.e., 2 14 ). Suc h a lev el of precision comes at the cost of introduc- ing a signiﬁcan t degree of sk ewness in the distribution of intensities. In the next section, this will b e discussed in greater detail. 3.2.2. Distribution of Pixel Intensities Since in this study , w e run all of our exp eriments on b oth JP2 and FITS images, it is imp ortant to ha ve a go o d understanding of the distribution of pixel intensi- ties in these tw o formats, the diﬀerences and similari- ties. W e b egin the discussion with the theoretical pixel in tensity extrema in FITS ﬁles, i.e., 0 and 16383. F or instance, in FITS format, the app earance of pixels with the maxim um brigh tness is not as frequen t as it is in the JP2 images.This is of course, the result of the JP2 lossy compression which transforms the pixel in tensit y domain of the FITS ﬁle into a m uch narro w er range of 0 to 255. How ev er, these extreme v alues are very likely to app ear in FITS images, in the bright regions caused b y the strong ﬂares. In the other extreme, for FITS format images, some negativ e v alues migh t be presen t, which app ear to be a byproduct of the p ost-pro cessing data transformation (level 0 to lev el 1 . 5) since the CCD de- tectors are not capable of recording negative v alues. As a pre-pro cessing step, we replace all the negativ e v alues with zeros in order to clean the data. It is interesting to note that such an extreme skewness in the distribution 4 Library: http://nom- tam- ﬁts.gith ub.io/nom- tam- ﬁts/ 8 Ahmadzadeh et al. 16383 Figure 3. A 3-D view of an AIA FITS image from the 171– ˚ A channel, with v alues ranging from 0 to 16383. of pixel in tensities is not limited to a sp eciﬁc w av elength c hannel, and is held true across all EUV and UV chan- nels. Next, we w ould lik e to learn ab out the amoun t of con- tribution of the extreme v alues in the distribution of pixel intensities. In this, w e are interested in knowing the percentage of pixels in each image that carry suc h extreme v alues. T o answ er this question, we studied one month worth of AIA FITS images, since 2010 . 09 . 01 through 2010 . 09 . 30, with the cadence of 2 hours, from 9 w av elength c hannels (excluding the visible w a velength, 4500– ˚ A), that sums up to a total of 3240 images. In Fig. 13 , the p -th p ercentile of the observed intensities for each of the images within this p erio d is shown. The maximum v alues in these plots should b e compared against the maximum intensit y reached during this pe- rio d, whic h is the theoretical maximum, i.e., 16383 for all 9 wa v elength c hannels. By lo oking at the spike in the ﬁrst plot (i.e., w a velength 94– ˚ A), we can see that 99 . 5% of the pixels in the corresp onding image had color in ten- sities less than 44, while pixels as brigh t as 16383 existed in that very image. Such signiﬁcan t gaps b etw een the mean v alues of the distributions and the maxima is sum- marized in T able 2 . 500k 0 10 20 30 40 50 51 0 500k 0 0 5k 10k 15k 16383 500k 0 100 200 250 255 P ix el Inten sity F r equenc y 150 0 50 A B C Figure 4. Distribution of pixel in tensities in a FITS im- age (A), a clipp ed FITS (B), and in a similar image in JP2 format (C). The illustration shows how clipping of the raw FITS image can reveal the hidden shap e of the bimo dal dis- tribution which is not visible in (A) due to the large n um b er of bins. The abov e statistical analysis suggests an extreme sk ewness in the distribution of pixel in tensities in FITS images. This is illustrated in plot A of Fig. 4 . The vi- sual eﬀect of suc h skewness is “underexposure”. In other w ords, if the pixel v alues of a FITS image are (linearly) transformed to the range of 8-bit images (i.e., [0 , 255]), the output would b e mostly black, with few to no small, extremely bright regions. It is important to note that our image parameters, which are utilized in sup ervised mac hine learning mo dels to distinguish the diﬀerent so- lar phenomena, are pixel-based features. That is, the relativ e diﬀerences b etw een the pixels’ brightness will be tak en in to accoun t and not their absolute v alues. There- fore, providing the classiﬁers with the original L1.5 AIA data con taining such far-out v alues, and not treating the outliers appropriately could bias the ﬁt estimates and distort the classiﬁcation results. W e provide more details on how this issue is addressed in the next section. 3.3. FITS, Clipp e d FITS, and JP2 In this section, w e will explain ho w w e prepro cess FITS ﬁles prior to the feature extraction and classiﬁ- cation tasks. It is w orth noting that, since such prepro- cessing steps in tro duce some changes o n the pixel v alues of the original L1.5 FITS ﬁles, for the sake of complete- ness of our later comparisons and to a void an y bias in our study , we extend our exp eriments to cov er the three data types: JP2, L1.5 FITS, and clipp ed FITS, as de- ﬁned in the follo wing sections. 3.3.1. Clipping FITS Images T reating the outliers is a common practice in the pro- cess of cleaning the data for any machine learning task, as they may introduce a signiﬁcan t bias to the learning pro cess and hence reduce the eﬀectiveness of the ex- tracted features for the classiﬁcation goal. In the case of outliers b eing the extreme v alues, the general ap- proac hes are : a) remo v al of the outliers, b) replacing them with some statistics (imputation), c) altering with exp ected extrema (capping), and d) predicting their “ex- p ected” v alues based on the lo cal changes of the intensi- ties. Of course, the remo v al of the outliers and re-scaling the v alues in to the quan tized range of 8-bit v alues would lea ve us with the results not so muc h diﬀerent than the existing JP2 images. This would void our attempt to Solar Image P arameter Da t a 9 T able 2. Maxim um p ercentiles of the pixel in tensities of AIA FITS images, observ ed from 9 w av elength channels, for the p erio d of 2010 . 09 . 01 to 2010 . 09 . 30, with the cadence of 2 hours. W 80-th 90-th 95-th 99-th 99.5-th Max 94 ˚ A 7 10 15 34 44 16383 131 ˚ A 19 30 43 88 123 16383 171 ˚ A 568 777 1034 1935 2602 16383 193 ˚ A 574 904 1354 2884 3968 16383 211 ˚ A 154 258 429 1159 1673 16383 304 ˚ A 116 151 188 327 431 16383 335 ˚ A 16 26 43 171 305 16383 1600 ˚ A 196 242 289 427 509 16046 1700 ˚ A 1801 2205 2558 3517 4138 16215 study the p otential diﬀerences in analysis of FITS ver- sus JP2. So, instead of remo ving outliers all together, w e will emplo y the capping approac h, that is also known as clip- ping if applied to images. The pro cess inv olv es ﬁnding a global cutting p oint on the sk ew ed tail of the probability distribution function, and shift all the pixel intensities ab o ve this threshold to this p oin t. By “global” cutting p oin ts, we mean thresholds that are ﬁxed across all AIA images for each wa v elength channel. This ensures that the clipping ﬁlter aﬀects all of the images uniformly . The result of suc h data transformation is that while no data p oin ts are remov ed (but shifted to the cutting point), the extreme sk ewness of the distribution is slightly mit- igated. W e use the maximum of the 99 . 5-th p ercentiles of pixel intensities as the global cutting p oint for eac h w av elength. That is, in the worst case scenario, 0 . 5% of the observed pixel in tensities will be shifted to the new maxim um p oint. The chosen cutting points for each w av elength is highlighted in T able. 2 . 3.3.2. Pixel Intensity T r ansformation After having used the statistically derived cut-oﬀ p oin ts for capping outlier pixel v alues, an additional pro- cessing step that should b e done is to re-scale the now capp ed v alues. Note that after clipping the FITS im- ages, although the distribution of pixel intensities are no w more naturally sk ewed, they do not hav e the same distribution as the pixels in JP2 images hav e. This is due to the non-linear transformation of the data in con- v ersion of FITS to JP2 format. This transformation is done b y Helioview er’s JP2GEN pro ject 5 . The trans- formation model, as w ell as their c hoice of the cut-oﬀ p oin ts, are primarily based on what the AIA pro ject rec- ommended at the time and how the Helioview er pro ject team w an ted the images to look like. As applying a transform function do es not introduce a loss of informa- tion in the data, and to ensure that the tw o sets of dis- tributions are similar in shap e, w e apply the same data transformation functions that were used in JP2GEN mo dule. The transformation metho ds diﬀer dep ending on the w av elength channel of the image. A line ar transforma- tion is used for 1700– ˚ A images, a squar e r o ot transfor- mation for images from 171– ˚ A, and a lo garithm trans- formation for the remaining c hannels. Note that, this is a bijection ( t : N − → R ) and no data p oints are re- mo ved. The result of suc h transformation is illustrated in Fig. 4 , on a sample AIA image. It compares the distribution of pixel intensities in a FITS image b efore clipping (A) and after clipping and transformation (B), and the one derived from the corresponding JP2 image (C). By lo oking at such comparison, one can see ho w the hidden bimo dal shap e of the distribution is p erfectly re- stored after clipping and transformation. This veriﬁes b oth the correctness and the imp ortance of this step for an unbiased comparison of diﬀerent image types. In ad- dition to that, a 3D mo del of the same AIA image in JP2 and in FITS both b efore and after clipping and transfor- mation is illustrated in Fig. 5 . In these visualizations, the spikes (representing the magnitude of brightness) reac h their highest v alues at 16383, 51 (i.e., ≈ √ 2602), and 255, in FITS, clipp ed FITS, and JP2, resp ectively . F rom this p oint on w e refer to the un-clipp ed FITS im- ages as L1.5 FITS , and to the clipp ed and transformed FITS as the clipp e d FITS . In this prepro cessing step, b efore clipping of the ex- treme far-out v alues, w e also take into account the exp o- sure time of the CCD detectors of the AIA instrument for eac h image. W e normalize the pixel intensities based on the speciﬁc exp osure time with which the image w as captured. This is important since it pro vides us a uni- form brigh tness in our image collection. These v alues are stored in the header section of each image, under the k eyw ord ‘EXPTIME’, as ﬂoating p oin ts in double precision (in seconds). ( Nigh tingale 2011 ). In summary , we analyze the AIA images in three diﬀeren t formats: L1.5 FITS (as arc hived in JSOC), clipp ed FITS, and JP2 (as provided b y Helioview er API) images. The L1.5 FITS and JP2 images are on the tw o 5 JP2GEN: https://gith ub.com/Helioview er- Pro ject/jp2gen 10 Ahmadzadeh et al. (a) FITS with max v alue at 16383 (b) Clipped FITS with max v alue at 51 (c) JP2 with max v alue at 255 Figure 5. 3-D views of an AIA image from the 171– ˚ A chan- nel, in diﬀeren t formats. The z-axis represents the pixel in- tensities. Notice that due to the extremely large spik es in the ra w FITS image, the full-size of the model for ra w FITS im- age, with the same prop ortions used in the other t wo, could not ﬁt here. An un-cut version of this mo del can b e seen in Fig. 3 . extreme ends of the pre-pro cessing path. L1.5 FITS image are only pre-processed in JSOC for cleaning and calibration in the pro cess of digitizing the images and are relatively large ﬁles (v arying from ≈ 5 to ≈ 14 MB). Whereas, JP2 images are fully pre-pro cessed and com- pressed (do wn to ≈ 1MB) to a t ypical 8-bit quan tized image format. Clipp ed FITS images lie somewhere in b et ween. They don’t ha ve the extreme far-out in tensi- ties as the L1.5 FITS images do, but at the same time, they are not limited to 255 gray levels as JP2 images are. As we mentioned b efore, we use all these three im- age t ypes to ev aluate our image parameters in Section 5 . 4. SETTINGS OF IMA GE P ARAMETERS No w that w e hav e studied our data types and the im- age parameters to b e tuned, w e need to sp ot the v ari- ables in each image parameter that can determine the p erformance of that parameter. In this section, we pro- vide more information ab out each of the four image pa- rameters and the implementation details of their com- putation that allow for the tuning of sp eciﬁc v ariables and their domains of c hanges. 4.1. Entr opy and Uniformity As discussed in Section 2.1.2 , en trop y and uniformity parameters solely dep end on the normalized histogram of the image color intensities. And it is in the nature of histograms that diﬀerent c hoices of the bin size result in diﬀerent lev els of smoothing the histogram. In other w ords, p in Eq. 1 and 2 , which is the probabilit y density function of the random v ariable i , is deﬁned diﬀerently for diﬀerent bin sizes. Although there are several gen- eral rules for determining the bin size, such as Sturges’ form ula ( Sturges 1926 ) or Scott’s rule ( Scott 1979 ), of- ten the b est choice is the one that is data driven and can b e v eriﬁed by the target classes of the data. So, for these tw o parameters, the optimal bin size is the v ariable that will b e tuned for utilizing the exp eri- men ts describ ed in Section 5 . The optimal v alue of the v ariable is indep endently ev aluated for each wa v elength of image and a set of these v alues are obtained through the exp erimental ev aluation, one for eac h wa v elength of image w e included in the resultant dataset. The domain set for this v ariable is the set (0 , I ) ⊂ N or R , dep ending on the image type, where I is the max color intensit y for the image t yp e under study . F or example, the domain set for this v ariable on the JP2 images from Helieoview er will b e the set of [0 , 255] ∈ Z , whereas the domain set for L1.5 FITS will b e the set of [0 , 16383] ∈ Z . 4.2. F r actal Dimension F ormerly , in Section 2.1.3 , we explained ho w fractal dimension utilizes the b ox counting metho d to measure the dimension of the fractal-lik e shap es. Ho wev er, there are a num b er of diﬀeren t decisions on the implemen- tation of this metho d that can ha ve an eﬀect on the resultan t v alues that it pro duces. F or instance, the de- cision on what edge detection algorithm and what v alues are used for v ariables of each of the diﬀeren t algorithms will pro duce diﬀering results. In the follo wing sections, w e will explain how this metho d will b e applied to AIA images, and what v ariables will need tuning in our ex- p erimen tal ev aluations of Section 5 . 4.2.1. Box Counting on AIA Images T o compute fractal dimension image parameter, we ﬁrst need to kno w how the b ox coun ting metho d that w e discussed b efore, can be applied on the AIA images. Let us assume that an edge detection algorithm has b een c hosen and the appropriate settings were found for the Solar Image P arameter Da t a 11 algorithm. W e can then apply an edge-detection algo- rithm to an AIA image and then treat the detected edges as the fractals’ contour whose dimension is to b e mea- sured. Then, for eac h ε (b ox’s side length) from a pre- deﬁned domain, we count the n um b er of grid cells that o verlaps with an edge. Considering the resultan t pairs, h ε, N ( ε ) i , as a set of p oints in the 2-D feature space of b o x sizes and the num ber of boxes, the slope β of the ﬁt- ted regression line can then be measured. β is the fractal dimension corresp onding to this region. Since the patch size of our image segmen tation discussed b efore is 64 × 64 pixels, the b o x size in the ab ov e pro cedure will hav e an upp er b ound of 64 pixels. T o hav e a natural sequence of side lengths for these b oxes, we use the set of all p ow ers of t wo within this range, i.e., { 2 , 4 , 8 , 16 , 32 , 64 } , as the domain of the b o x side length. F ractal dimension provides a measure to quantify the complexit y of the shap es’ contour, with larger v alues indicating higher complexity . In Fig. 6 , we show how the complexity of a shap es’ con tour aﬀects the fractal dimension v alue by using tw o groups of test signals that are generated to mimic fractal-lik e shap es. One set is created by adding an incrementally increasing random noise to a sine wa v e, and the other one, by adding an incremen tally increasing frequency of another sine w av e to the base sine wa v e. Measuring the dimension of each signal, a roughly linear growth of fractal dimension is observ ed that conforms to our exp ectation. A B Figure 6. An exp eriment that shows the growth of fractal dimension on a series of sine w a v es in t wo diﬀerent situations: a) with an iterative increase of random noise to the signal and b) with an iterativ e increase of frequency of another sine w av e to the signal. The results conﬁrms the sensitivity of this parameter to the complexity of the shap es’ contour. 4.2.2. Edge Dete ctors The brief explanation of the b ox coun ting metho d tells us that the eﬀectiv eness of the fractal dimension param- eter in describing the textural feature of an image relies on the quality of the edge detector metho d that provides the fractal-like shap es. That is, a noisy input, as well as an o verly smo othed image, ma y render this parameter completely ineﬀectiv e. This fact is the motiv ation for the follo wing survey of existing edge detection methods and their p erformance on AIA images. Note that for this application, b oth the qualit y of the detected edges that are to b e the input to the box counting metho d, and the execution time of eac h of the edge detection metho ds are imp ortan t, as a longer execution time will require more computational resources for the near real time constrain t to b e met. Among the existing edge detection metho ds, w e c ho ose Sob el ( Sob el 1990 ), Prewitt ( Prewitt 1970 ), Rob erts Cross ( Rob erts 1963 ) edge detectors as the classical candidates, Canny’s ( Canny 1986 ) edge detec- tor as a p opular, mo dern metho d, and also SUSAN ( Smith & Brady 1997 ) as a less p opular but a more recen t approac h. It has been shown in sev eral diﬀerent comparativ e analysis ( Maini & Aggarwal 2009 ; Heath et al. 1996 ; Shariﬁ et al. 2002 ) that Cann y edge detec- tion algorithm p erforms b etter than all of its ancestors in most scenarios, esp ecially on noisy images. Giv en the sp ecial noisy nature of the AIA solar images, with la y- ers of noisy textures instead of solid foreground ob jects and bac kground landscap es, the classical methods are lik ely to fail. That being said w e do not wish to sim- ply rely on general kno wledge ab out the p erformance of these methods on textural inputs. Instead, we apply these ﬁlters on AIA images and compare the quality of the detected edges that are to be the input to the b ox coun ting metho d. The ﬁrst three edge detection metho ds, Sob el, Pre- witt and Rob erts Cross, are relatively simple algorithms. They each b egin by estimating the ﬁrst deriv ativ e of the image by their corresp onding gradient operators (masks). Then, since the magnitude of the gradien t v ectors do not give thin and clear edges, non-maximum suppression is also applied (as it is done in Canny) to eliminate the multiple representations of each edge. The results of the Sob el, Prewitt, and Rob erts Cross meth- o ds can b e seen in Figures 7b , 7c , and 7d resp ectively . Cann y edge detection, how ev er, is more complicated and starts with a prior smo othing step using a 5 × 5 Gaussian kernel. This mitigates the eﬀect of noise on calculation of the gradient. Then, using a 3 × 3 Sobel op erator, the gradient of each pixel, g = ( g x , g y ), which is a vector with magnitude p g x 2 + g y 2 and orientation arctan( g y /g x ), is calculated. Eac h pixel having nine ad- 12 Ahmadzadeh et al. jacen t neighbors, allo ws nine diﬀeren t angles for the edge passing through that pixel. Since only the orien tation of the edges matter (and not the direction), the choices will b e limited to four. Therefore, the con tin uous range of the calculated angles should b e quantized and mapped to one of the follo wing c hoices: 0 ◦ , 45 ◦ , 90 ◦ , or 135 ◦ . This is follow ed by a thinning pro cess of the edges (i.e., non-maxim um suppression) which eliminates the pixels whic h are lab eled as edges but their lo cations are not in line with the calculated orientation of the edges. At the end, a hysteresis thresholding comes to clean up the disconnectivit y of the edges by using t w o thresholds; a lo w threshold, lt , and a high threshold, ht . Any pixel with gradien t magnitude greater than ht is labeled as an edge, and a non-edge if its magnitude is less than lt . F or pixels with magnitudes b etw een l t and ht , they are con- sidered part of an edge if and only if they are connected to a pixel which is already lab eled as an edge. This last step, next to the initial smo othing step, mak es Canny edge detector an exp ensive ﬁlter, but this cost pays oﬀ b y producing less broken edges and a less noisy output. SUSAN edge detector on the other hand, adopts a v ery diﬀeren t approach by not using an y image deriv a- tiv es which makes it a go o d candidate for noisy images lik e ours. This is the very reason for including it in our list, despite its computation cost. This edge dete ctor has a core concept called Univ alue Segment Assimilating Nucleus (in short USAN) which is the cen tral point (n u- cleus) of the circular masks, and a principle called SU- SAN principle which is stated as follows: “An image pro- cessed to giv e as output inv erted USAN area has edges and t w o dimensional features strongly enhanced, with the tw o dimensional features more strongly enhanced than edges”. The intensit y of the nucleus and the sec- ond momen t of the area of USAN masks are use d to ﬁnd the edge directions. And even tually , similar to Canny , a non-maximum suppression will b e applied to clean up the edges. In this study , w e use the implementation of this metho d from Op enIMAJ library ( Hare et al. 2011 ). T o compare the qualit y of these edge detectors, we fed each of those methods with a v ariety of AIA im- ages v arying in the queried time of the solar even ts, w av elength c hannels, and the app earing even t types. Fig. 7 illustrates one of the visual comparisons; a cut- out of an activ e region instance observed on Marc h 7, 2012 from the 171– ˚ A channel and the output of each of the ab ov e-mentioned edge detectors. As it is visible in this comparison, Cann y edge detector provides muc h cleaner edges and maintains the orientation of the c oro- nal loops (that electriﬁed plasma ﬂows along) of the ﬂar- ing region, whereas others barely distinguish the texture caused by the p o werful magnetic ﬁelds from the more 32 64 pxl 16 8 4 2 (a) A ﬂaring region (b) Sob el (c) Prewitt (d) Rob erts (e) SUSAN (f ) Canny Figure 7. A cut-out of an active region instance observed on Marc h 7, 2012 at 00:24:14:12 UT from the 171– ˚ A c hannel, as well as the outputs of diﬀerent edge detector metho ds are sho wn. In a, the relativ e size of the b oxes (i.e., 64 , 32 , 16 , 8 , 4, and 2 pixels) used in the b ox counting metho d is also illus- trated. quiet (darker) areas. Giv en that the edges detected are to be passed to the b o x coun ting metho d with the b ox sizes as large as those sho wn in Fig. 7a , it is visually con vincing that for the Sob el-like metho ds (i.e., Sob el, Prewitt, and Roberts), such a uniform distribution of the extremely short and broken edges do es not lead to a reliable measure of the dimension corresp onding to dif- feren t regions. Ab out SUSAN’s output (see Fig. 7e ), although the results are very diﬀerent from the others, it do es not seem to b e a go o d choice for noisy textures as it do es v ery little in identifying the visible edges. Another argument in fa vor of Cann y edge detector is the tunability of this metho d that is p ossible by ad- justing its three v ariables; the standard deviation of the Gaussian smo othing ( σ ) and the low er ( lt ) and higher ( ht ) thresholds, as discussed in Section 4.2.2 . In Fig. 8 , Solar Image P arameter Da t a 13 T able 3. The a verage execution time for diﬀerent edge de- tection metho ds on 4096 × 4096–pixel AIA images. Metho d Execution Time (Sec.) 1 Sob el 2 . 267 2 Prewitt 2 . 208 3 Rob erts 1 . 809 4 SUSAN 0 . 674 5 Canny 3 . 619 the eﬀect of such tuning on the same sample active re- gion used b efore is shown. Note the smo oth decrease in the noise lev el as σ increases while the general patterns and directions are main tained. Regarding the running time of these methods, T able 3 summarizes our comparisons. Although, the execution time of the utilized methods is an important factor in general, in this case, it do es not seem that there are man y choices left for us, except the relatively most ex- p ensiv e one, i.e., Canny edge detector. This is because only this metho d is pro ducing the relev ant input for the b o x coun ting metho d of the fractal dimension param- eter. The decision is b et ween a faster metho d which mostly pro duces uniform noise, and a relatively more exp ensiv e one that provides the right input (where the ph ysical c haracteristics suc h as the coronal lo ops as the curving lines of p ow erful magnetic ﬁelds are enhanced) for fractal dimension. The results listed in T able 3 are the av erage execution time measured by running each of the algorithms on a group of 100 full-disk AIA images of size 4096 × 4096 pix- els in 10 diﬀerent wa v elength c hannels, having diﬀerent ev ent types. T o put the num bers in context, it is worth noting that these exp eriments are conducted on a Linux mac hine with a core i 5 − 6200 U CPU, 2 . 30GHz × 4, and an 8GB of memory , while for any operational task, a m uch more p ow erful machine would be used to pro cess the images. Therefore, the running time of Canny edge detector is expected to b e less than 3 . 619 seconds for a single image. Ha ving Canny edge detector chosen as the metho d to ﬁlter the input AIA images and pass them to the box coun ting method, tuning of fractal dimension parameter w ould then dep end on the choices of lt , ht , and σ of the edge detector. Our experiments show that b y changing σ while having l t and ht ﬁxed at a narrow interv al close to zero (e.g., l t = 0 . 02 and ht = 0 . 08), we could cov er almost the entire sp ectrum of the p ossible outputs. This observ ation leav es only one v ariable, σ , for the tuning of this image parameter. Figure 8. Cann y edge detector on an active region instance, with lt = 0 . 02, ht = 0 . 08, and σ v arying from 1 to 6, starting at top-left image and ending at the b ottom-right. 4.3. T amur a Dir e ctionality The general formula to compute the directionality pa- rameter was explained in Section 2.1.4 . As it calculates the weigh ted v ariance of the gradient angles, it requires the gradient of the image to b e calculated b eforehand. F or an image I , the gradien t vector is: ∇ I = [ g x , g y ] = " ∂ I ∂ x , ∂ I ∂ y # (4) from whic h the direction and magnitude of the vectors can b e calculated as follo ws: [ φ, r ] =  arctan  g y g x  , q g x 2 + g y 2  (5) There are diﬀerent k ernel con v olution matrices used to approximate the gradient v ector of an image. Since no preprocessing such as smoothing is required for this task, their computation time dep ends only on the ker- nel size. Therefore, we limit our c hoices to the simple but w ell-known gradients, suc h as Sob el–F eldman ( So- b el 1990 ), Prewitt ( Prewitt 1970 ), and Rob erts Cross ( Rob erts 1963 ). The last one has a 2 × 2 k ernel ma- trix that makes it slightly faster but more sensitive to noise, due to its smaller kernel matrix comparing to the 3 × 3 matrices of the other t w o. After w e visually studied the remaining t wo k ernels, w e observed that both their gradien t outputs and the histograms of angles are fairly similar. Therefore, we decided to utilize Sob el–F eldman as our gradient mask, whic h seems to b e more p opular and widely used in diﬀeren t libraries and applications. F rom the derived gradien t matrix, the histogram of angles can b e computed and passed to Eq. 3 . Now, the tuning of T dir has come down to a p eak detection metho d that identiﬁes the “dominant” peaks. There- fore, to ac hiev e an y impro v ement on this parameter, a 14 Ahmadzadeh et al. p eak detection algorithm m ust b e utilized. There has b een a great deal of eﬀort in identiﬁcation of pe aks and v alleys, sp ecially in the domain of time series analysis and signal pro cessing ( Palshik ar et al. 2009 ). But it is imp ortan t to note that p eak iden tiﬁcation is a sub jective task that is often determined by the general b ehaviour of the data under study . Since p eak detection tends to be a domain sp eciﬁc task, where eac h domain has diﬀerent criteria for the deﬁnition of p eaks, it is logical to design a p eak detection metho d whic h is more compatible with the type of the data we hav e, i.e, the distribution of the gradien t angles of the AIA images. The metho d that w e ha ve chosen to utilize is explained in greater detail in ( Ahmadzadeh et al. 2017 ). In the next section, w e brieﬂy review this approac h. 4.3.1. Pe ak Dete ction In general, the p eak identiﬁcation task is to deter- mine the domains, d i , within whic h the lo cal maxima of the data sequence C = { c 1 , c 2 , · · · , c n } are lo cated. In other words, the goal is to identify d i ’s suc h that ∃ c i ∈ d i , ∀ c ∈ d i , c i ≥ c . W e build our algorithm on the basis of a na ¨ ıv e assumption that it is enough for eac h data point to b e compared only with its adjacent p oin ts in the sequence, meaning that for a lo cal maxi- m um c i , the domain would b e d i = { c i − 1 , c i , c i +1 } . If c i satisﬁes the condition, we consider it a c andidate p e ak . Then w e pass the candidate peaks to a three–fold ﬁl- tering pro cess to pick only the most signiﬁcant ones. A t each step, w e chec k one of the user–deﬁned criteria, namely the threshold, t , the minim um distance, d , and the maxim um num ber of p eaks, n . First, we remo ve all candidate peaks whic h lie b elow the threshold t . The p eaks whic h are to o close to a dominan t one will be remo ved in the next step. Starting from the iden tiﬁed p eaks with greater v alues w e simply remov e their neigh- b ors within the radius of d . And ﬁnally , just to provide a control to ol for the cases where a certain count of the p eaks is of in terest, w e keep the top n peaks and drop the rest. The proposed algorithm, in spite of its simplicit y , pro- vides a ﬂexible to ol to determine the signiﬁcance of the dominan t peaks in a data–driven fashion. Using this algorithm, tuning of this parameter is b ound to the three abov e-men tioned v ariables of the p eak detection metho d. 4.4. Summary of Settings In Summary , for each image parameter we managed to identify the v ariables and their domains, that play a role in tuning of that parameter. W e use these v ariables to ﬁnd the b est settings for the image parameters to obtain the highest accuracy in prediction of the solar ev ents. The v ariables of interest for eac h of the four image parameters are summarized b elo w: 1. Uniformit y: the num ber of bins, n , 2. En tropy: the num b er of bins, n , 3. F ractal dimension: the Gaussian smo othing pa- rameter used in Cann y edge detector, σ , 4. T amura directionality: the threshold, t , the min- im um distance, d , and the maximum n um b er of p eaks, n , used in our p eak detection metho d. 5. EXPERIMENT AL ANAL YSIS In this section we discuss the tuning pro cess of the image parameters listed in T able 1 . W e s tart with ex- plaining our metho dology as our general approach to- w ards tuning the parameters, and then w e elab orate on the details of the task for each of the four image param- eters separately . Finally , we rep ort the p erformance of eac h of the parameters in classiﬁcation of activ e region, coronal hole and quiet sun ev ent instances. 5.1. Metho dolo gy Among the ten image parameters, the descriptive statistics (i.e., µ , σ , µ 3 , µ 4 ) dep end only on the in ten- sit y v alue of the pixels. On the basis of these statistics, relativ e smoothness and T amura contrast can be then calculated. None of these six parameters ha ve any con- strain ts, th us not tunable. F or the remaining four pa- rameters, we run a univ ariate parameter tuning pro cess on their constrain ts which we identiﬁed in Section 4 . F or each parameter, ﬁrst, we ﬁnd the set of n key constrain ts (or v ariables), and iden tify appropriate n u- meric domains, d i , for each constraint i ∈ { 1 , 2 , . . . , n } . As a result, w e will ha v e a feature space of dimension | d 1 | × | d 2 | × · · · × | d n | , for that particular image param- eter, where | d i | is the cardinalit y of the domain set d i . In addition, to describ e a particular ev en t, a region of in terest must b e processed that spans ov er a v ariable n umber of grid cells. This presents the problem of com- paring v ariable sized regions of interest in order to ﬁnd the optimal setting for the v arious parameter v ariables. F or instance, if the region spans o ver k grid cells, it will then b e represented b y a v ector of length k , for eac h image parameter. So, in order to compare the v ariable sized regions of in terest that pro duce diﬀerent-length v ectors, we use a sev en-num ber statistical summary on the resultan t v ec- tors. This pro cess will map eac h v ariable sized parame- ter v ector that is computed on a region to a consistent length vector of sev en diﬀerent v alues. These vectors are computed independently for each of the 9 ultraviolet Solar Image P arameter Da t a 15 (UV) and extreme ultra violet (EUV) w av elength chan- nels from the AIA that we include in our inv estigations. Since these channels produce signiﬁcan tly diﬀerent im- ages of the Sun, w e expect that eac h c hannel will require individual tuning of the parameter calculation v ariables in order to tak e such diﬀerences in to consideration and pro duce the b est results for eac h wa v elength. Clearly , ev en for a very small domain for the con- strain ts of any one parameter, a high-dimensional space will b e generated by this statistical summary method and therefore, dimensionalit y reduction is necessary to minimize the eﬀect of the well-kno wn curse of dimen- sionalit y . T o this end, w e use the F-test statistic to rank each of the settings and then select the b est ones p er w av elength. W e use only the b est settings to pro- duce our ﬁnal feature space, which is then utilized to pro vide a comparison of the three diﬀerent input image t yp es through a sup ervised classiﬁcation of solar even ts. The ranking process in F-test relies on grouping of the data and measuring the ratio of b etw een-group v ariabil- it y and within-group v ariabilit y . Our me thodology can b e summarized in the following ﬁv e steps: 1. Determining the dimension of the feature space (i.e., iden tifying the constraints and their do- mains), 2. Building the feature space for the p erio d of one mon th (i.e., January 2012), 3. Reducing the dimensionality of the feature space using F-test (i.e., ﬁnding the b est settings p er w av elength), 4. Building the (reduced) feature space for the perio d of one y ear (i.e., 2012), 5. Measuring the qualit y of the parameter using su- p ervised learning. In the follo wing sections, after w e talk about the dataset w e used for our exp eriments, w e explain the sp eciﬁc de- tails of our metho dology for eac h parameter. 5.2. Dataset for Sup ervise d L e arning F or the learning and classiﬁcation phase, w e emplo y ed the same metho dology in collection of data that was used b y Sc huh et al. ( 2017 ) to collect one y ear w orth of AIA images o v er the en tire 2012 calendar y ear and the spatiotemp oral data related to the solar even ts re- p orted in this p erio d. Here, w e only brieﬂy explain the data acquisition pro cess and refer the interested reader to the article where the en tire pro cess is explained in great detail. W e target tw o solar even t types, namely active region (AR) and coronal hole (CH), which are in particular of in terest for helioph ysicists and also because of their sim- ilar rep orting c haracteristics that make region identiﬁca- tion easier. As our ground truth, we rely on the AR and CH catalogs of the HEK (Heliophysics Even ts Knowl- edgebase) which are detected b y SPoCA (Spatial Pos- sibilistic Clustering Algorithm) ( Hurlburt et al. 2010 ). In year 2012, HEK rep orted 13 , 518 AR and 10 , 780 CH ev ent instances, at approximately a four hour cadence. Since there are more AR instances, we ﬁrst collect all of those instances and then we lo ok for CH instances within a time windo w of ± 60 min utes from eac h rep ort of an AR instance. Those AR instances that could not b e paired with a temp orally close CH instance are dropp ed. The rep ort of eac h even t con tains b oth temp oral and spa- tial information. W e use the time stamps of the rep orts to retrieve the corresp onding AIA images (in JP2 and FITS format). The spatial data of eac h instance consists of a cen ter point for the rep orted even t, its b ounding b o x, and p olygonal outline. W e use the b ounding b oxes to extract the image parameters on the region corre- sp onding to each even t instance in our training and test phase. With suc h constraints, w e managed to retrieve 2 , 116 unique pairs of AR and CH instances. As our sup ervised learning mo del requires a control class, an ev ent type that p oints to a region of solar disk with no rep ort of any other solar ev en ts, an artiﬁcial even t called quiet sun (QS) is introduced. T o collect a set of such in- stances temporally linked to our AR-CH collection, for eac h rep ort of an AR even t, the b ounding b ox of that ev ent is used to randomly search for regions that ha ve no in tersection with any rep orts of AR or CH even ts. 5.3. Determining the F e atur e Sp ac e Generally , in the mac hine learning discipline, a feature is a measurable property of a data p oint being observ ed. F or instance, for AIA images as the data points in our study , en trop y of the pixel intensities of an image is a feature deriv ed from that image. Given d diﬀeren t fea- tures, a feature space, is a d -dimensional space where eac h of its dimensions corresponds to one of the fea- tures. Here, w e are trying to tune our image parameters one by one, and we may ha ve one or more v ariables for eac h image parameter. So, instead of having multiple features, w e are dealing with m ultiple v ariations of a single feature. In other words, we deriv e multiple fea- tures from one single parameter and consider them as diﬀeren t features. Therefore, the feature space deﬁned b y an image parameter with one v ariable that takes | d | diﬀeren t v alues, is a d -dimensional space. Similarly , for an image parameter with tw o v ariables, a ( | d 1 | × | d 2 | )- 16 Ahmadzadeh et al. dimensional space will b e generated, where | d i | is the cardinalit y of the domain set for the i -th v ariable. 5.3.1. F e atur e Sp ac e for Entropy and Uniformity The admissible feature space suggested b y en tropy or uniformit y parameter is a d -dimensional space, where d is the cardinalit y of the candidate set for the n umber of bins. The ev aluation of b oth en tropy and uniformity is therefore deﬁned as a search o v er a uniformly distributed n umber of bins to ﬁnd the b est p erforming set for our classiﬁcation task. F or the original images in b oth JP2 and FITS format, the pixel in tensities v ary within a ﬁxed range, and therefore, the general form of the candidate set can b e form ulated by the following formula: n k · j max − min l k ; l ∈ N , k ∈ { 1 , 2 , 3 , · · · , l } o where l is the bin size, and k is a scalar. F or JP2 images ( min = 0, max = 255), our visual ex- p erimen ts sho w that l = 20, letting the num ber of bins b e c hosen from the set N J P 2 = { 12 , 24 , 36 , · · · , 255 } , giv es us a comprehensive enough candidate set for cre- ating the feature space. Using such a set, 21 diﬀerent en tropy (similarly uniformit y) parameters will b e gener- ated, with bin widths ranging from 1 to 21 units. Sim- ilarly , for L1.5 FITS images, ( min = 0, max = 16383), the n um ber of bins will b e c hosen from the candidate set N F I T S = { 780 , 1560 , 2340 , · · · , 16383 } . F or the clipp ed FITS images, how ev er, since the max v alues diﬀer from one wa v elength to another, the candi- date set should also adapt to the corresponding range. As the new maxima are muc h smaller than the global maxim um, due to the transformation of the pixel v al- ues (discussed in Section 3.3.2 ), the ab ov e mo del results in bagging of most of the pixel intensities in one single bin and leaving the other bins empty . T o a void such an o verly smo othed histogram, in addition to substituting the after-clipping maxima instead of the global maxi- m um, w e downsize the bins by a factor of 10. This is of course meaningful since for the clipped images, the pixel in tensities are real num bers, as opp osed to the integer in tensities in the L1.5 FITS images. F or example, for AIA images from 94– ˚ A channel, since the after-clipping range of the pixel intensities is [0 , 44], the candidate set for the n um b er of bins w ould be { 20 , 41 , 62 , · · · , 440 } , where in the most extreme case, the bin size will b e as small as one ten th of a unit (i.e., 440 bins for the inter- v al 0 to 44). In general, regardless of the wa v elength, |N J P 2 | = |N F I T S | = |N cF I T S | = 21. 5.3.2. F e atur e Sp ac e for F r actal Dimension Our experiments in Section 4.2 concludes that the feature space formed b y this image parameter will b e (a) JP2 (b) d = 1 (c) d = 7 (d) d = 20 Figure 9. An AIA image in JP2 format from 171– ˚ A c hannel, and the heat-maps of T amura directionality with diﬀerent v alues for the v ariable d , where t = 90%. determined only b y the domain of the v ariable σ in Cann y edge detection metho d. They also sho w that for σ greater than 5 (when lt = 0 . 02 and ht = 0 . 08) the re- sults are very similar to one another and they all main- tain only the very strong edges. Observing the amount of c hanges in the output as σ increases, suggests that the candidate set S = { 0 . 0 , 0 . 5 , 1 . 0 , · · · , 5 . 0 } generates an admissible space. 5.3.3. F e atur e Sp ac e for T amur a Dir e ctionality As our analysis in Section 4.3 shows, the v ariables in our p eak detection method, i.e., t and d , determine the feature space for T amura directionality . As for the threshold on the frequency domain of the p eak detection metho d, we consider the ﬁrst, second, and third quar- tiles of the frequency , b elow whic h the p eaks w ould b e ignored, as our candidates. W e also add the 90-th p er- cen tile to allow observing the results for the cases that only the signiﬁcan tly dominan t peaks are to b e tak en in to account. The domain for this v ariable is therefore the set T = { 0 . 25 , 0 . 50 , 0 . 75 , 0 . 90 } . T o determine the domain for d , the minimum distance b et ween the p eaks, we should tak e a look at the his- togram of angles. With n bins, such a histogram can b e generated as follo ws: h D = n N θ ( k ) P n − 1 i =0 N θ ( i ) ; 0 ≤ k ≤ 2 n − 1 o (6) where N θ ( x ) is the frequency of the angles within the in terv al h k π 2 n , ( k + 1) π 2 n  . Since what T amura direc- tionalit y targets is not the angle but the direction of the lines, the resultan t histogram will b e symmetric around θ = 0 ◦ . T o a void redundant computation, w e consider only the angles within the in terv al [0 , 180 ◦ ). Setting n to 90 gives us a histogram with the breaks at 0 ◦ , 2 ◦ , 4 ◦ , · · · , and 180 ◦ . F or this domain of angles, the set D = { 1 , 3 , 5 , · · · , 29 } is an admissible domain for the minimum distance betw een tw o p eak. Note that those v alues indicate the minimum distance (in num ber Solar Image P arameter Da t a 17 of bins) for a peak to ha ve from an already identiﬁed p eak, to b e considered a dominant p eak. In Fig. 9 , the heat-maps of T amura directionalit y for three diﬀerent settings of d are sho wn. 5.4. Building the F e atur e Sp ac e F or eac h of the four image parameters, w e compute its feature space by calculating all diﬀerent v ariations of that parameter on one mon th worth of 4 k AIA images (Jan uary , 2012). This is done on JP2, FITS, and clipped FITS images, separately . 5.5. Dimensionality R e duction T o reduce the dimensionality of the computed fea- ture spaces, the F-test in one-wa y analysis of v ariance (ANO V A) is used to pic k the feature (p er w av elength) whic h has the highest rank in separation of the three solar even t-types, active region, coronal holes, and quiet sun. The score of eac h feature is computed as the ratio of betw een-group v ariability and within-group v ariabil- it y , where all the instances of each solar ev ent t ype form a single group. The ranking pro cedure is as follows: for eac h feature, or setting, all the instances of the three ev ent-t ypes reported b y HEK will be collected. Using random undersampling, we make sure that the num b er of instances in all three categories is the same to remedy the class-imbalance problem. After computing the fea- tures of interest on the image cells spanning the bound- ing boxes of ev ents, the results will b e summarized using the sev en-n umber summary . With a ten-fold sampling, w e use the F-test to rank the settings. W e then ag- gregate the scores p er setting on its seven-n umber sum- mary , and ﬁnally sort the settings b y their scores and rep ort the highest p er w av elength. As an example, the parameter T amura directionality on JP2 AIA images in 94– ˚ A wa v elength channel, with t = 25 and d = 1, was rank ed the b est compared to any other v ariation of that image parameter. T able 4 summarizes the b est setting p er wa v elength channel, for each of the three image for- mats. T o help understand ho w the b est setting for an image parameter pro vides a b etter distinction betw een the in- stances of diﬀerent even t-t ypes, an example is illustrated in Fig. 10 . In this visualization, the image parameter is T amura directionalit y , and the chosen statistics is Q 1 (ﬁrst quartile). The diﬀerence b etw een the distribution of Q 1 of this parameter with the b est setting as op- p osed to an arbitrary setting, on the three even t t yp es is shown. Note how in plot A, where the b est setting is used, the three distributions are muc h more distin- guishable compared to B where an arbitrary setting is used. A B Figure 10. This plot illustrates the diﬀerence b et ween the distribution of statistics of the best setting for an image pa- rameter (A) and an arbitrary setting (B), on one month w orth of 4 K AIA images. The three colors distinguish the distributions of diﬀerent solar even t types (active region, coronal hole, and quit sun), and the dotted lines indicate the mean v alues of the distributions . Note how in A the three distributions are more distinguishable. In this example, the image parameter is T amura directionality , the wa v elength is 94– ˚ A, and the statistics is the ﬁrst quartile. After this step, for each of the four image parameters, the dimensionality of the deﬁned space shrinks do wn signiﬁcan tly , from several thousands to 63 (for 9 wa v e- length c hannels and 7 summary statistics). 5.6. Building the R e duc e d F e atur e Sp ac e After reducing the dimensionality , the b est setting for eac h image parameter is used to form the reduced fea- ture space. This new feature space will then b e gener- ated based on one year (Jan 1 through Dec 31, 2012) w orth of AIA images, for JP2, L1.5 FITS, and Clipp ed FITS images, with the cadence of 6 minutes. 5.7. Classiﬁc ation T o measure the p erformance of the four image param- eters after ﬁnding the b est setting for eac h of them, w e emplo y t w o classiﬁers, namely Na ¨ ıv e Ba y es and Random F orest 6 . Na ¨ ıve Bay es classiﬁer ( Maron 1961 ) is a sim- ple statistical mo del that learns by applying the Ba y es’ theorem with strong indep endence assumption, on the lab eled data and classiﬁes based on the maxim um a p os- teriori rule. In the con text of our data points, for an ev ent instance e t rep orted at time t , whic h can b e of t yp e AR , C H , or QS , it calculates the feature v ector v t = { x 1 , · · · , x n } , where n is the dimension of the de- 6 W e use the Statistical Machine Intelligence and Learning En- gine (smile) Jav a library: http://haifengl.gith ub.io/smile/ . 18 Ahmadzadeh et al. FITS Clipped FITS JP2 Image T ypes: Comparison of Image P aramet ers in Classi ﬁ c ation of Solar E vents on Di ﬀ er ent Ima ge F or mats A ctive R egion Cor onal Hole Quiet Sun Image P aramet ers Naï ve Ba yes A ctive R egion Cor onal Hole Quiet Sun Rando m F orest F1-Sc or e F1-Sc or e Figure 11. The classiﬁcation results on the three even t types (activ e region, coronal hole, and quiet sun) using Na ¨ ıv e Bay es (ﬁrst row) and Random F orest (second row) classiﬁers are illustrated here, separately for each ev ent t yp e using the f1-score measure. Each reported measure is av eraged ov er 495 trials of a 10-fold cross v alidation sampling. Each trial was executed on a random sample of even ts’ instances from 13 , 518 AR, 10 , 780 CH, and 13 , 518 QS even t instances, within the p erio d of 01-01-2012 through 31-12-2012. F or each bar, the num ber on the bottom represents the f1-score v alue and the error interv al sho ws the standard deviation of the f1-score. The image parameters are entrop y (EN), uniformity (UN), F ractal Dimension (FD), and T amura Directionalit y (TD). ﬁned feature space, and then classiﬁes e t ’s ev ent type, denoted b y ˆ y t , as follo ws : ˆ y t = arg max C k ∈{ AR,C H,QS } p ( C k ) n Y i =1 p ( x i | C k ) (7) Since Na ¨ ıve Bay es classiﬁer relies only on the prob- abilit y of the occurrences of the ev ents, the mo del is exp ected to p erform p o orly in classiﬁcation of the less trivial cases. F or the sak e of completeness, we also em- plo y Random F orest classiﬁer ( Ho 1995 ) for ev aluation of the image parameters. This is an ensem ble learning mo del that builds the decision trees on samples of data (a pro cess called b o otstrap aggregating) and classiﬁes the class lab el by taking the ma jority vote of the trees classifying each data p oint. F or our data, we generate a forest of 60 diﬀerent trees, each of whic h classifying the ev ent t yp es of the instances and at the end, the ensem ble mo del makes the ﬁnal decision b y taking the ma jorit y v ote of the trees. F or b oth classiﬁcation mo dels, w e perform a k -fold cross-v alidation by sampling the even ts’ instances on all com binations of any group of 4 months in the y ear 2012, resulting in  12 4  = 495 diﬀeren t trials. This allows hav- ing the test sets unbiased to the p otential patterns in o ccurrence of solar even ts. Using rep etitive random un- dersampling, we av oid the negative eﬀect of imbalanced datasets as w ell. F or rep orting the p erformance of these mo dels we c ho ose f1-score measure (also kno wn as F-Score or F- Measure) which is the harmonic a v erage of the precision and recall. Giv en precision p be the n umber of cor- rect p ositive classiﬁcation divided b y the total num ber Solar Image P arameter Da t a 19 of (correct or incorrect) p ositive results returned b y the mo del, and recall r be the num ber of correct p ositiv e classiﬁcations divided b y the total num b er of instances of positive class, f1-score can b e formulated as follows: f1-score = 2 · ( p × r p + r ) . (8) Since we hav e three classes (AR, CH, and QS) for our classiﬁcation mo dels, f1-score should b e rep orted for eac h class separately . T o measure p and r for our ternary mo del, w e use the one-against-all strategy whic h aims to classify an ob ject of one t yp e compared to the other t wo, whereas the one-against-one strategy w ould con- sider all pairs of classes and rep ort the classiﬁcation p er- formance separately , which is unnecessary for our task. F urthermore, it is imp ortant to note that the under- sampling step employ ed in the k -fold cross-v alidation pro vides balanced data for the mo dels. Therefore, our c hoice of the p erformance measure do es not need to b e class-im balance resistant, e.g., T rue Skill Score. The results of our exp eriments, using b oth Na ¨ ıv e Ba yes and Random F orest mo dels, are illustrated in Fig. 11 . The key p oints ab out the results are enumer- ated b elo w: • The p erformance of the tw o mo dels is based on sin- gle image parameters and not their combinations. Random F orest, as we predicted b efore, p erforms signiﬁcan tly b etter. Using this mo del, one can ob- serv e that each of the four image parameters can individually classify active region instances fairly w ell (f1-score > 0 . 8) regardless of the image for- mat. F or the coronal hole instances, the results are only slightly low er but consisten t ( ≈ 0 . 7 when JP2 images are used). The fact that such high conﬁdence levels are reached using a set of very ba- sic image parameters that are not domain sp eciﬁc (i.e., not tailored for classiﬁcation of phenomena suc h as solar even ts) should stress the imp ortance of our c hoices. • Note that the relatively p o or performance of b oth of the mo dels in classiﬁcation of QS is not a large concern, since it is just a syn thesized ev ent and some other even t types that are rep orted to HEK but not used in this study could b e adding noise to the instances lab eled as QS, resulting low er purity in the class labels. How ev er, the results are still ab o ve those expected if the samples w ere simply assigned a random lab el and therefore indicate the p ossibilit y that these parameters can transfer to other ev ent type classiﬁcation. • Another v ery imp ortant aspect of the results is in the comparison of the classiﬁcation on diﬀeren t image formats, as the plots depict. F or Random F orest classiﬁer, in almost all cases, JP2 format is sho wn to b e the better input for the mo del, com- pared to b oth FITS and clipp ed FITS. Even for Na ¨ ıve Bay es classiﬁer which did not p erform as w ell as Random F orest did, there is no consistent sup eriorit y when FITS or clipp ed FITS images w ere used compared to the JP2 format. This is de- spite the fact that FITS format theoretically con- tains more information than the compressed JP2, and therefore pro duces m uch larger ﬁles. In fact, an image in FITS format is 5 to 14 times larger than its JP2 version, depending on the w a v elength c hannel used. With such understanding, we can no w make our entire image rep ository ≈ 10 times smaller in size, with ev en some impro v ement in classiﬁcation of solar ev ents. As one of our main contributions was to provide a dataset of tuned image parameters, we compare the clas- siﬁcation of the solar even ts b efore and after the tuning steps on the image parameters. As shown in Fig. 12 , our tuning results in signiﬁcant improv emen t for all of the four image parameters across the even t types. Note that the p erformance on the image parameters without tuning is only sligh tly ab ov e the random guess whic h is 0 . 33. This is simply b ecause the previous computa- tion of the image parameters lack the thorough analysis of the individual parameters, and the tailored tuning steps. Of course, the scop e of this study is limited to tuning the image parameters, and the results in Fig. 11 and 12 reﬂect only the impact of the obtained image parame- ters, while better mo dels (with higher p erformance or more robustness) can p otentially be trained by explor- ing diﬀeren t classiﬁers, such as SVM or ev en deep neural net works, and tuning their h yp er-parameters in a data- driv en fashion. 6. THE RESUL T ANT DA T ASET Ha ving demonstrated the eﬀectiveness of utilizing tuned parameter settings for JP2 format AIA images, w e then set out to pro duce a dataset ( ≈ 1TiB/y ear) that is easily accessible for researchers wishing to uti- lize this data. The dataset we hav e created contains the ten image parameters listed in T able 1 , whic h are pro cessed from images captured by the SDO spacecraft, and are extracted from the AIA images at a six-minute cadence for eac h wa v elength we pro cess. As previ- ously mentioned, the original images are high resolution (4096 × 4096 pixel), full-disk snapshots of the Sun, tak en 20 Ahmadzadeh et al. .84 .84 .85 .81 .48 .47 .51 .48 .72 .72 .68 .73 .50 .50 .48 .51 .64 .65 .65 .54 .55 .55 .46 .55 .01 .01 .02 .08 .02 .02 .01 .01 .02 .02 .02 .03 .03 .05 .03 .04 .03 .01 .01 .01 .02 .06 .01 .02 Comparison of Image P aramet ers Befor e and A f ter T uning in Class i ﬁ cation of Solar Event s A ctive R egion Cor onal Hole Quiet Sun Image P aramet ers Original T uned F1-Sc or e Rando m F orest Figure 12. The illustration compares p erformances of Random F orest classiﬁer in classiﬁcation of three solar even t types using eac h of the four image parameters, b efore and after tuning. The image parameters are entrop y (EN), uniformity (UN), F ractal Dimension (FD), and T amura Directionalit y (TD). in ten extreme ultra violet c hannels (the nine channels that we utilize in this work are 94 ˚ A, 131 ˚ A, 171 ˚ A, 193 ˚ A, 211 ˚ A, 304 ˚ A, 335 ˚ A, 1600 ˚ A, and 1700 ˚ A) Lemen et al. ( 2012 ). The original high resolution images are acces- sible upon request from the Joint Science Op erations Cen ter, but our dataset is pro cessed from the the JP2 compressed images a v ailable through the random access API at the Helio viewer rep ository 7 . W e ha v e created an API 8 that allows for the random access of the pro duced image parameter data. The pro- cessed dataset starts with observ ations from Jan uary 1, 2011 00:00:00 UTC and our inten t is to con tinue to k eep the dataset updated with the curren t observ ations for as long as the source of our data con tinues to pro vide new observ ations. The metho ds used for calculating the pa- rameter v alues are released as part of our Op en Source library DMLabLib 9 . The settings for each of the pa- rameter calculation metho ds that require some sort of setting v alue are listed in T able 4 of A . Note that each of the nine wa veband channels that we pro cess has its o wn set of settings for each of the parameter calculation metho ds. One already established use case for this dataset is trac king solar even ts that hav e b een rep orted to the HEK ( Kempton et al. 2018 ; Kempton & Angryk 2015 ) where the parameters are used to p erform visual com- parisons of detections forming diﬀerent p ossible paths a trac ked ev ent could take. Another is the use of the pa- 7 https://api.helio view er.org 8 http://dmlab.cs.gsu.edu/dmlabapi/ 9 https://bitbuc k et.org/gsudmlab/dmlablib rameters to p erform whole image comparisons for sim- ilarit y searc h in the context of conten t based image re- triev al ( Kempton et al. 2016b ). Similarly , the parame- ters hav e also b een used to p erform region comparison for similarit y search in the context of region based con- ten t based image retriev al ( Sc huh et al. 2017 ). These are just a few of the p ossible use cases that we kno w hav e utilized a smaller and un-optimized previous v ersion of this dataset. A provides some additional analysis of the dataset pro duced b y this work. 7. CONCLUSION AND FUTURE W ORK W e presented the background information ab out the AIA images pro duced by the SDO mission and com- pared the FITS and JP2 image formats and the dis- tribution of the pixel intensities in each of them. W e also reviewed diﬀerent aspects of each of the ten image parameters that we hav e selected to extract the imp or- tan t features of those images and then explained ho w w e designed several diﬀerent exp erimen ts to ﬁnd the b est settings for eac h of the features on diﬀerent wa v elength c hannels a nd the diﬀeren t image formats. After w e ob- tained the b est settings for eac h of the image parame- ters, we pro cessed one y ear worth of data and extracted those features from the images queried with the cadence of 4 hours. Finally , w e presen ted our public dataset as an API b y running several statistical analysis to illustrate a more accurate picture of the ready-to-use dataset. W e hop e that our public dataset in terests more re- searc hers of diﬀerent backgrounds and attracts more in- terdisciplinary studies to solar images. While we aim to k eep our API data up-to-date with the stream of data coming from the SDO, we would like to expand it b y Solar Image P arameter Da t a 21 adding more in teresting image parameters, speciﬁcally computed for diﬀerent solar even ts, which could lead to a b etter understanding of solar phenomena and higher classiﬁcation accuracy . This work w as supp orted in part by t wo NASA Grant Aw ards [No. NNX11AM13A, and No. NNX15AF39G], and one NSF Gran t Award [No. A C1443061]. The NSF Gran t Aw ard has b een supp orted by funding from the Division of Adv anced Cyb erinfrastructure within the Di- rectorate for Computer and Information Science and En- gineering, the Division of Astronomical Sciences within the Directorate for Mathematical and Ph ysical Sciences, and the Division of Atmospheric and Geospace Sciences within the Directorate for Geosciences. Also, w e would lik e to men tion that all images used in this work are courtesy of NASA/SDO and the AIA, EVE, and HMI science teams. APPENDIX A. ST A TISTICAL ANAL YSIS OF D A T ASET In this section, we present more statistical insigh t ab out the prepared dataset through a num b er of ﬁgures. Fig. 13 illustrates the c hanges in the distribution of pixel intensities of FITS images for the mon th of Septem b er 2012, with the cadence of 2 hours. W e use this to supp ort our argument for the cut-oﬀ p oin t used in clipping of the FITS ﬁles in ev ery wa velength channel (see Section 3.2.2 ). Observing the changes of the 99 . 5-th p ercentile of the pixel in tensities in FITS images, knowing that several pixels with the maximum intensit y v alue (i.e., 16383) are presen t within this p erio d, tells us that clipping at the highest p oint reached by this p ercentile while reducing the range of the in tensities signiﬁcan tly , only aﬀects 0 . 5% of the pixels. As an example, for images in 94– ˚ A (see the ﬁrst plot at the top of this ﬁgure), the highest v alue reached by the 99 . 5-th p ercentile of the pixel v alues is equal to 44 while pixels as bright as 16383 are present. Among the ﬁve diﬀeren t p ercen tiles, the one with the minimum eﬀect on the images, i.e., 99 . 5-th, is chosen for clipping of the FITS images to generate the new set of images that we referred to as clipp e d FITS . The few sudden changes of the pixel in tensities in Fig. 13 as w e inv estigated, are mainly due to the sev eral C– and M– class ﬂares rep orted in this p erio d. In some cases, the magnetically c harged particles reac hing the CCD detectors of the AIA instrumen t, also result in o v erexp osed images, hence the spik es. T o presen t a big picture of the ﬂow of data in the dataset, we show the mean v alue of eac h of the ten image parameters after they are extracted from the AIA images, for the entire month of January 2012 (Fig. 14 ). The ten image parameters for this plot are computed on the en tire full-disk images and the mean statistics is then extracted from the resultan t matrix. T o presen t the contin uit y of the collected and computed data, we presen t the time diﬀerences b et ween the image data p oin ts of our dataset, for the en tire calendar y ear of 2012, with the cadence of 6 minutes, in Fig. 15 and for one mon th, across nine wa v elength bands in Fig. 16 . The small p erio ds where the v alues go to zero in Fig. 14 are artifacts of missing input data and/or corrupted images that are uniformly blac k. Similarly , the perio ds where the time betw een rep orts p eaks for some perio d is another indication of missing input data. This can b e caused by any of numerous p ossible reasons that could cause a step in the pro cessing pip eline to fail to receiv e an image from the previous step in the pip eline. These can range from the satellite not transmitting the data in the ﬁrst place, to an error at any one of the pro cessing steps prior to our pro cessing of the JP2 image from Helio viewer. The missing data can also b e caused, as found in Sc huh et al. ( 2015 ), b y the mo on or earth itself o ccluding the view of the sun from the satellite on almost a daily basis, as seen in March 2012 in Fig. 15 . In all, this do es not represent a signiﬁcan t portion of the dataset giv en that the data corresp onding to a few mon ths in 2012 are missing the largest p ortion compared to other years. A t the end, the b est settings derived and used to generate this dataset is presented in T able. 4 . The numeric v alues men tioned in this table are mostly useful for the purp ose of repro ducibility of the dataset, since this is possible for those who ﬁnd the creation steps of the dataset interesting, thanks to our op en source library , DMLabLib 10 . B. IMP ACT OF NON-ZERO QUALITY OBSER V A TIONS In this section, w e address the speciﬁc concern regarding the impact of the AIA instrument degradation, as w ell as usage of the “lo w quality” images, on our dataset. By “low quality” w e mean images whose QUALITY ﬂag in 10 https://bitbuc k et.org/gsudmlab/dmlablib 22 Ahmadzadeh et al. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 300 200 400 500 100 0 200 300 200 100 300 400 100 0 500 150 0 100 0 200 0 300 0 400 0 100 0 200 0 300 0 400 0 100 0 200 0 25 50 75 100 125 10 20 30 40 P er centiles 80-th 90-th 95-th 99-th 99.5-t h P ix el Inten sity W : 94 Å W : 131 Å W : 171 Å W : 193 Å W : 211 Å W : 304 Å W : 335 Å W : 160 0 Å W : 170 0 Å T ime (days in 2012 -09) Figure 13. Diﬀerent percentiles of pixel in tensities for ≈ 3240 AIA FITS images (i.e., approximately 360 images per wa v elength c hannel). Each of the nine plots corresp onds to one w a velength channel of the AIA instrument, sp eciﬁed in cyan, on the left. Eac h curv e trac ks the c hanges of the pixel in tensity distribution of images captured ev ery 2 hours, within the perio d of Decem b er 2012. Solar Image P arameter Da t a 23 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 F ract al Dimension Standar d Deviation Entr opy Mean Sk ewness K urtosis Unifor mity R elative Smoothness T amur a Contr ast T amur a Dir ectionality T ime (days in 2012 -01) Figure 14. Mean of the ten image parameters extracted from images queried for a p erio d of one month (2012 − 01). With the cadence of 6 minutes, the plot represents 7440 AIA images from the wa velength channel 171– ˚ A. 24 Ahmadzadeh et al. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 20 40 60 80 2012 Dec 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 10 40 50 20 30 2012 Nov 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 10 40 50 20 30 2012 Oct 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 100 400 0 200 300 2012 Sep 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 100 400 0 200 300 100 400 0 200 300 2012 Aug 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 100 0 200 300 2012 Jul 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 250 0 500 750 100 0 125 0 2012 Jun 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 100 125 75 50 25 2012 May 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 0 100 200 2012 Apr 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 400 0 200 600 2012 Mar 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 10 15 20 2012 F eb 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 2012 Jan T ime (6-m inute c adenc e) Gap (in minutes ) Figure 15. The time diﬀerences (in minutes) b etw een image parameter ﬁles for AIA images, from the wa v elength channel 171– ˚ A, ov er the entire p erio d of the year 2012. Solar Image P arameter Da t a 25 1700 Å 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 200 0 400 600 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1600 Å 100 400 0 200 300 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 335 Å 500 0 100 0 150 0 304 Å 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 200 0 400 600 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 211 Å 100 400 0 200 300 193 Å 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 100 400 0 200 300 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 171 Å 100 400 0 200 300 131 Å 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 200 800 0 400 600 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 94 Å 100 0 0 200 0 300 0 T ime (Jan 2012, with 6 -minute cadenc e) Gap (in minutes ) Figure 16. The time diﬀerences (in minutes) b etw een image parameter ﬁles for AIA images, from the 9 diﬀerent wa v elength c hannels, ov er the month of Jan uary 2012. 26 Ahmadzadeh et al. T able 4. The b est settings p er wa v elength, for the four image parameters across three image formats are listed here. In this table, n indicates the num ber of bins used to compute entrop y or uniformit y , t are d are the threshold and p eak-to-p eak distance, resp ectively , used to measure directionality , and ﬁnally the v ariable σ stands for the Gaussian smo othing parameter required in computing fractal dimension. F or more details ab out these v ariables, see Section 4.4 . W av elength Uniformity F ractal Dimension T amura Directionalit y Entrop y ( ˚ A) n sigma t d n JP2 94 12 2.0 25 1 12 131 36 1.0 25 1 60 171 60 4.5 75 1 12 193 97 1.0 25 1 24 211 84 1.5 25 1 12 304 36 3.5 75 1 12 335 97 2.0 25 1 12 1600 109 2.5 90 1 12 1700 48 4.0 90 3 12 Clipp ed FITS 94 62 4.5 7 5 104 131 1230 4.0 7 4 175 171 3717 4.5 9 3 1239 193 1889 5.0 6 2 1889 211 796 2.0 9 4 796 304 615 5.0 9 4 615 335 1888 4.0 7 4 435 1600 5090 4.5 7 4 2666 1700 1970 3.0 4 3 1970 L1.5 FITS 94 12 4.0 25 21 3900 131 36 5.0 90 1 780 171 60 0.0 25 23 780 193 97 1.0 75 1 780 211 84 1.0 75 1 780 304 36 5.0 75 1 780 335 97 4.0 25 21 2340 1600 109 5.0 90 3 780 1700 48 3.5 25 23 780 Solar Image P arameter Da t a 27 their header is set to a non-zero v alue ( Nightingale 2011 ). This v alue is an integer whose 32-bit binary representation describ es 32 diﬀeren t issues, such as missing ﬂat-ﬁeld data, missing orbit data, and the like. B.1. Imp act of CCD De gr adation The CCDs (charge coupled device) of the AIA instrumen t, like an y electronic devices, are sub ject to degradation. The impact of CCD degradation was known prior to the launch of SDO ( Bo erner et al. 2011 ), and has been studied ev er since (e.g., F ontenla et al. ( 2016 )). The eﬀect of instrument degradation is a secular decrease ov er time in the data coun ts of the FITS ﬁles, which results in a gradual decrease in the pixel in tensities of the AIA images. This trend, although is very subtle and only visible when the av erage data coun ts of FITS ﬁles are monitored ov er the course of sev eral years, can p oten tially impact many pixel-based analyses of solar even ts (to the b est of our knowledge, no study has provided suﬃcient evidence for such impact, and the c haracteristics of the tasks impacted are not clearly known). T o this end, a p erio dic re-calibration of the instrument w as planned prior to the launch of SDO and has b een and will con tin ue to b e carried out perio dically to ensure the qualit y of the data. The details of such calibration pro cess is describ ed in Boerner et al. ( 2011 ). Our dataset is based on the lev el 1 . 5 data utilized by Helioview er, whose gains are adjusted to use the ab o ve mentioned calibration so that there is a consistent “zero level” in the images. In case the ab ov e pro cedure do es not fully resolve the degradation impact, w e still believe that the eﬀect should b e negligible to our dataset. This is mainly b ecause of the diﬀerent nature of our data p oints and the applications this dataset is meant to b e used for. Speciﬁcally , the data points in our dataset are extracted image parameters, and not the raw pixel v alues. F urthermore, in this study , we were able to show that the extreme high end of the range of v alues in the recorded L1.5 FITS images are actually detrimental to results in our analysis, and therefore we are clipping these v alues. The clipping was done either in our pre-pro cessing phase when we used the FITS ﬁles, or b y Helio viewer’s JP2GEN pro ject that provided the JP2 images for our analyses. So, the dynamic range compression in the images that is in tro duced by having to turn up the gain as the CCD deteriorates will most likely not hav e a noticeable impact, if at all, on our w ork. Additionally , the extracted parameters used in this study are minimally aﬀected by the long-term global changes in image intensities, esp ecially when applied to the clipped images. As an example, consider the standard deviation parameter from our dataset. This is computed in lo cal regions of a pro cessed image and the subtle changes of the o verall dynamic range of the brightness of source images, caused by a drifting “zero level”, will hav e minimal eﬀect on the results when applied to images that are pre-pro cessed using a clipping metho d to reduce the dynamic range of the in tensity v alues. Another example would b e fractal dimension, which is computed on the detected edges. As discussed in subsection 2.1.3 , the edge detection is carried out based on the lo cal gradients within images, and therefore, mild long-term changes such as the one imp osed by CCD degradation, will not hav e a signiﬁcan t impact on the computed dimension, if at all. Among the ten image parameters, only mean parameter is susceptible to the degradation. The magnitude of the impact can b e dete rmined b y the degree of degradation that could not b e completely resolved in the AIA lev el 1.5 data pro ducts. B.2. Imp act of Instrument A nomalies Based on our empirical study of hundreds of AIA images with non-zero QUALITY v alues (i.e., lo w quality images), these images fall in to tw o main groups. One comprises the images whic h are visually no diﬀerent than any zero QUALITY AIA images. In fact, in some cases the missing information do es not aﬀect the pixel v alues of the images at all. The other group, how ev er, contains images in which the Sun’s disk is rotated, shifted, or blo ck ed due to eclipse, or b ecause of some instrumental artifacts, large patches of black squares app ear on the images. These are certainly not prop er inputs for an y analyses. T o the b est of our kno wledge, the frequency of the 32 diﬀerent quality ﬂags has not b een studied on AIA images y et. Our brief study on several (non-consecutive) months worth of AIA images, with the cadence of 36 seconds, shows presence of ≈ 4 . 2% of non-zero QUALITY images (both group one and tw o). Of course to achiev e a reliable statistics as the fraction of low quality images on the entire AIA data collection, a m uch larger sample should b e pro cessed. But unfortunately , lack of prop er documentation on the FITS keyw ords and absence of a publicly a v ailable database of the header information, makes it diﬃcult to obtain a more thorough analysis on this topic. Therefore, we will leav e the computation of a more comprehensive statistic on the fraction of images with fundamental quality issues (i.e., the second group), to the original AIA image data pro viders. Since we computed the ten image parameters on all AIA images that fell into our sampling cadence, regardless of their quality ﬂag, we added the QUALITY v alue of images to 28 Ahmadzadeh et al. our database, and provided the user with the corresponding requests to retrieve the QUALITY v alues from the API, as w ell as some other basic spatial header information which are needed for lab eling of the solar even ts. It is up to the in terested researchers to decide whether they prefer to keep the low quality images for their study or not. It is w orth noting that, regarding the ﬁrst group of images, lack of some pieces of information may disqualify such images for some sp eciﬁc scientiﬁc analyses, how ever, w e believe that machine learning mo dels built on the extracted image parameters (i.e., our dataset) would not b e eﬀected by such unnoticeable diﬀerences. Prepro cessing the raw data and achieving a cleaned dataset are indeed critical steps in any data-related analyses. This is, in fact, the premise of the current s tudy . Having that said, machine learning mo dels are designed to ha ve a degree of resistance against noise. As they learn the global patterns and structures of the data by ﬁtting mathematical models against a very large n um b er of data points, and very often in a high-dimensional v ector space, ha ving a few data p oints with some additional noise in just a few dimensions, w ould not impact the o v erall p erformance of the mo dels. This is our reasoning for not excluding the low-qualit y images. But users of the dataset can decide on this based on their understanding of the impact of lo w-quality images on their desired mo dels. C. IMP ACT OF HETEROGENEOUS EXPOSURE TIME AIA is equipp ed with an automatic exp osure con trol (AEC) which adjusts the length of time the cameras’ sensors are exp osed to light. This adjustment takes into account the ov erall brigh tness of the Sun. During o ccurrence of some solar activities suc h as large ﬂares, some regions on the Sun are signiﬁcan tly brighter. In such cases, a shorter exposure time could pro duce an image of a higher quality . The exp osure time used for each image is recorded in their header information. W e use this information to normalize the pixel intensities of each image before we compute the image parameters. REFERENCES Ahmadzadeh, A., Kempton, D. J., Sc huh, M. A., & Angryk, R. A. 2017, in Big Data (Big Data), 2017 IEEE In ternational Conference on, IEEE, 2518–2526. h ttps://doi.org/10.1109/BigData.2017.8258210 Annadhason, A. 2012, IRACST-In ternational Journal of Computer Science and Information T echnology & Securit y (IJCSITS) vol, 2 Asc hw anden, M. J., & Asch w anden, P . D. 2008, The Astroph ysical Journal, 674, 530 Ba jcsy , R. 1973, in Pro ceedings of the 3rd international join t conference on Artiﬁcial intelligence, Morgan Kaufmann Publishers Inc., 572–579 Banda, J., Angryk, R., & Martens, P . 2013, Solar Physics, 288, 435 Banda, J. M., & Angryk, R. A. 2009, in F uzzy Systems, 2009. FUZZ-IEEE 2009. IEEE International Conference on, IEEE, 2019–2024 Banda, J. M., & Angryk, R. A. 2010a, in Digital Image Computing: T echniques and Applications (DICT A), 2010 In ternational Conference on, IEEE, 528–534 Banda, J. M., & Angryk, R. A. 2010b, in FLAIRS Conference, 380–385 Banda, J. M., Angryk, R. A., & Martens, P . C. 2011, 3669. h ttps://doi.org/10.1109/ICIP .2011.6116515 Barbieri, A. L., De Arruda, G., Ro drigues, F. A., Bruno, O. M., & da F ontoura Costa, L. 2011, Physica A: Statistical Mechanics and its Applications, 390, 512 Bo erner, P ., Edwards, C., Lemen, J., et al. 2011, in The Solar Dynamics Observ atory (Springer), 41–66 Cann y , J. 1986, IEEE T ransactions on pattern analysis and mac hine intelligence, 679. h ttps://doi.org/10.1049/esn.1986.0002 Council, N. R. 2008, Severe Space W eather Ev ents–Understanding Societal and Economic Impacts: A W orkshop Rep ort (W ashington, DC: The National Academies Press), doi:10.17226/12507. h ttps://doi.org/10.17226/12507 Da vis, L. S., Johns, S. A., & Aggarwal, J. 1979, IEEE T ransactions on Pattern Analysis & Machine Intelligence, 251 Flic kner, M., Sawhney , H., Niblack, W., et al. 1995, computer, 28, 23 F ontenla, J., Co drescu, M., F edrizzi, M., et al. 2016, The Astroph ysical Journal, 834, 54 Hare, J. S., Samango o ei, S., & Dupplaw, D. P . 2011, in Pro ceedings of the 19th ACM international conference on Multimedia, MM ’11 (New Y ork, NY, USA: ACM), 691–694. h ttp://doi.acm.org/10.1145/2072298.2072421 Heath, M., Sark ar, S., Sano cki, T., & Bowy er, K. 1996, in Computer Vision and Pattern Recognition, 1996. Pro ceedings CVPR’96, 1996 IEEE Computer Society Conference on, IEEE, 143–148 Solar Image P arameter Da t a 29 Hinneburg, A., Aggarwal, C. C., & Keim, D. A. 2000, in 26th Internat. Conference on V ery Large Databases, 506–515 Ho, T. K. 1995, in Proceedings of 3rd international conference on do cument analysis and recognition, V ol. 1, IEEE, 278–282 Hurlburt, N., Cheung, M., Schrijv er, C., et al. 2010, in The Solar Dynamics Observ atory (Springer), 67–78. h ttp://dx.doi.org/10.1007/978- 1- 4614- 3673- 7 5 Islam, M. M., Zhang, D., & Lu, G. 2008, in Multimedia and Exp o, 2008 IEEE International Conference on, IEEE, 1521–1524 Justice, J. H. 1986, in Proceedings of the 4th Maximum En tropy W orkshop, Universit y of Calgary , 1984, Cam bridge: Universit y Press, 1986, edited by Justice, James H. https://doi.org/10.1049/esn.1986.0002 Kempton, D., & Angryk, R. A. 2015, Astronomy and Computing, 13, 124. h ttp://dx.doi.org/10.1016/j.ascom.2015.10.005 Kempton, D. J., Sch uh, M. A., & Angryk, R. A. 2016a, in In ternational Conference on Artiﬁcial Intelligence and Soft Computing, Springer, 88–101. h ttp://dx.doi.org/10.1007/978- 3- 319- 39384- 1 8 Kempton, D. J., Sch uh, M. A., & Angryk, R. A. 2016b, in 2016 IEEE International Conference on Big Data (Big Data), 3168–3176. h ttps://doi.org/10.1109/BigData.2016.7840972 Kempton, D. J., Sch uh, M. A., & Angryk, R. A. 2018, The Astroph ysical Journal, 869, 54. https: //iopscience.iop.org/article/10.3847/1538- 4357/aae9e9 Lemen, J. R., Title, A. M., Akin, D. J., et al. 2012, SoPh, 275, 17 Maini, R., & Aggarwal, H. 2009, International journal of image pro cessing (IJIP), 3, 1 Mandelbrot, B. 1967, science, 156, 636 Maron, M. E. 1961, Journal of the ACM (JA CM), 8, 404 Martens, P ., Attrill, G., Da vey , A., et al. 2012, Solar Ph ysics, 275, 79. h ttp://dx.doi.org/10.1007/s11207- 010- 9697- y Nigh tingale, R. 2011, AIA/SDO FITS keyw ords for scien tiﬁc usage and data pro cessing at levels 0, 1.0, T ech. rep., and 1.5. T ech. Rep., Lo c kheed-Martin Solar and Astroph ysics Lab oratory (LMSAL) P alshik ar, G., et al. 2009, in Proc. 1st Int. Conf. Adv anced Data Analysis, Business Analytics and In telligence, V ol. 122 P esnell, W. D., Thompson, B. J., & Chamberlin, P . C. 2012, SoPh, 275, 3 Pluim, J. P ., Maintz, J. A., & Viergever, M. A. 2003, IEEE transactions on medical imaging, 22, 986 Prewitt, J. M. 1970, Picture processing and Psyc hopictorics, 10, 15 Razlighi, Q., & Kehtarna v az, N. 2009, in Visual Comm unications and Image Processing 2009, V ol. 7257, In ternational So ciety for Optics and Photonics, 72571X Rev athy , K., Lekshmi, S., & Nay ar, S. P . 2005, Solar Ph ysics, 228, 43 Rob erts, L. G. 1963, PhD thesis, Massach usetts Institute of T echnology Sc huh, M., Angryk, R., & Martens, P . 2015, Astronom y and Computing, 13, 86. h ttps://doi.org/10.1016/j.ascom.2015.10.004 Sc huh, M. A., & Angryk, R. A. 2014, in Big Data (Big Data), 2014 IEEE International Conference on, IEEE, 53–60. h ttps://doi.org/10.1109/BigData.2014.7004404 Sc huh, M. A., Kempton, D., & Angryk, R. A. 2017, in FLAIRS Conference, 526–531 Scott, D. W. 1979, Biometrik a, 66, 605 Shannon, C. E. 2001, A CM SIGMOBILE mobile computing and communications review, 5, 3 Shariﬁ, M., F athy , M., & Mahmoudi, M. T. 2002, in Information T echnology: Co ding and Computing, 2002. Pro ceedings. International Conference on, IEEE, 117–120 Skilling, J. 1989, in Maximum entrop y and Ba yesian metho ds (Springer), 45–52 Smith, S. M., & Brady , J. M. 1997, International journal of computer vision, 23, 45 Sob el, I. 1990, Machine vision for three-dimensional scenes, 376 Starc k, J.-L., Murtagh, F., Querre, P ., & Bonnarel, F. 2001, Astronom y & Astrophysics, 368, 730 Sturges, H. A. 1926, Journal of the american statistical asso ciation, 21, 65 T amura, H., Mori, S., & Y amaw aki, T. 1978, IEEE T ransactions on Systems, man, and cyb ernetics, 8, 460. h ttps://doi.org/10.1109/TSMC.1978.4309999 T runk, G. V. 1979, IEEE T ransactions on Pattern Analysis & Machine Intelligence, 306 V erb eeck, C., Delouille, V., Mampaey , B., & De Visscher, R. 2014, Astronomy & Astroph ysics, 561, A29. h ttp://dx.doi.org/10.1051/0004- 6361/201321243 V erleysen, M., & F ran¸ cois, D. 2005, in International W ork-Conference on Artiﬁcial Neural Netw orks, Springer, 758–770 W eisstein, E. W. 2008 W ells, D. C., & Greisen, E. W. 1979, in Image Pro cessing in Astronomy , 445 With bro e, G. L. 2013, Living With a Star (American Geoph ysical Union), 45–51. h ttp://dx.doi.org/10.1029/GM125p0045

A Curated Image Parameter Dataset from Solar Dynamics Observatory Mission

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment