This paper addresses the interpretability of deep learning-enabled image recognition processes in computer vision science in relation to theories in art history and cognitive psychology on the vision-related perceptual capabilities of humans. Examination of what is determinable about the machine-learned image in comparison to humanistic theories of visual perception, particularly in regard to art historian Erwin Panofsky's methodology for image analysis and psychologist Eleanor Rosch's theory of graded categorization according to prototypes, finds that there are surprising similarities between the two that suggest that researchers in the arts and the sciences would have much to benefit from closer collaborations. Utilizing the examples of Google's DeepDream and the Machine Learning and Perception Lab at Georgia Tech's Grad-CAM: Gradient-weighted Class Activation Mapping programs, this study suggests that a revival of art historical research in iconography and formalism in the age of AI is essential for shaping the future navigation and interpretation of all machine-learned images, given the rapid developments in image recognition technologies.
particularly the case with image recognition and retrieval tools. While research in this area is often sequestered to the fields of computer science and business development in the technology sector, I aim to demonstrate how the analysis of images with deep-learning techniques can engage the humanities, complement existing sociocultural theories, and offer the possibility of new methodologies for image analysis that take cognitive psychology into consideration. Let us 8 therefore begin our analysis of the development of DeepDream as a part of the larger turn toward deep-learning techniques in computer vision science in order to examine some ways in which images are processed by machines in comparison to humans. In essence, let us consider the iconology of the digital image vis-à-vis deep neural networks, what I have termed the machinelearned image. 9 As a part of the 2014 ImageNet Large-Scale Visual Recognition Challenge, a deep convolutional neural network architecture was created by a team of computer scientists from Google, the University of North Carolina at Chapel Hill, the University of Michigan at Ann Arbor, and Magic Leap Incorporated. The goal of this competition was to improve the 10 classification of images and the detection of their contents. The success of this research group's network was based on its computational efficiency and enhanced ability to analyze images through multiple wide layers of coding. Applied to the interpretation of two-dimensional digital images, this branch of machine learning, based on neural networks, more commonly known as deep learning, can be imagined as a complex data filtration system that processes information layer by linear-and non-linear-algorithmic layer. Like a water filtration system that distills liquid by outputting only what has already been filtered to each subsequent level, a convolutional neural network has the capacity to filter information according to multiple parameters, the size of ! 3 its constituent parts being just one component of the system. In both examples, there are many steps in between the input and output, which allows for increased complexity in the final analysis.
While the traditional programming model is based on breaking down large problems into smaller solvable tasks, deep learning allows the computer to find its own solutions to presented problems through multiple interlaced layers. Although the programmer sets the initial parameters and provides the visual training set that a computer will utilize to detect certain images and their constituent parts, how this is actually achieved remains rather mysterious to the computer scientist. Consider the development of facial recognition programs, such as the one created by 11 Facebook: a system can learn to detect a face and compare it to other faces through layers of convolutional neural networks if a large and coded dataset from which the computer is trained is available. Untangling what the computer finds to be most predictive in facial recognition out of the many strengthened neural pathways that are found in the image-recognition process, however, is not always possible. This unknown dimension, when the computer essentially neural networks trained to discriminate between different types of images also have the information needed to generate them, the Google team aimed to use visualization as one way to determine whether the network had correctly learned the right features of an image by displaying their associations in the classification process. For example, a neural network designed to recognize dumbbells seemed to be doing so correctly, but was in fact inaccurately making the classification. Since the computer was trained with numerous images of weightlifters holding dumbbells (fig. 2), the neural net determined that the weightlifter’s arm was an essential part of the classification of the object since the majority of online images of dumbbells include an arm curling them. For this reason, the DeepDream algorithm generated a visualization of dumbbells the Blind, the program labeled the image as a carousel and identified places on the picture field outside of the central scene of Christ’s performance of the miracle as the locations that were significant for identification (fig. 4). By contrast, an image of a thirteenth-century icon depicting the Madonna and Child is labeled by Grad-CAM as a “book jacket” and highlights the Virgin’s face and the surrounding area, but does not emphasize Christ. Late Antique parchment leaves 18 from Egypt displaying Greek text were identified as “honeycomb,” and the program selected seemingly random locations including the background mat on which the artifact rests as significant for its content. Not surprisingly, a photograph of Georgia O’Keeffe clutching her 19 coat collar, by Alfred Stieglitz, performed somewhat better with the program (fig. 5). It recognized the hands on the garment as the significant part of the image when generating a caption,
This content is AI-processed based on open access ArXiv data.