Steganographic Generative Adversarial Networks
Steganography is collection of methods to hide secret information ("payload") within non-secret information "container"). Its counterpart, Steganalysis, is the practice of determining if a message contains a hidden payload, and recovering it if possi…
Authors: Denis Volkhonskiy, Ivan Nazarov, Evgeny Burnaev
Steganographic Generativ e Adversarial Netw orks Denis V olkhonskiy 1 , Iv an Nazarov 1 , Evgeny Burnae v 1 1 Skolko v o Institute of Science and T echnology Nobel str eet, 3, Moscow , Mosk ovskaya oblast’, Russia e-mail: e.burnae v@sk oltech.ru ABSTRA CT Steganography is collection of methods to hide secret information (“payload”) within non-secret information “con- tainer”). Its counterpart, Steganalysis, is the practice of determining if a message contains a hidden payload, and recov- ering it if possible. Presence of hidden payloads is typically detected by a binary classifier . In the present study , we propose a new model for generating image-lik e containers based on Deep Con volutional Generati ve Adversarial Networks (DCGAN). This approach allows to generate more setganalysis-secure message embedding using standard steganograph y algorithms. Experiment results demonstrate that the new model successfully decei v es the steganography analyzer , and for this reason, can be used in steganographic applications. Keyw ords: generative adv ersarial networks, steg anography , security 1. INTR ODUCTION Recent years have seen significant advances in estimation methods and application of deep generati ve models. There are two major general frame works for learning deep generativ e models: V ariational Autoencoders (V AEs), [17], and Generativ e Adversarial Networks (GANs), [7]. The recent work of Hu et al. [11] develops a unifying frame work, which establishes strong connections of these approaches to Adversarial Domain Adaptation (AD A), [5]. GANs ha ve achie ved impressiv e results in semi-supervised learning, [16], and image-to-image translation, [13]. In [7] the success of GANs frame work was illustrated on the problem of image generation. A more recent paper [29] proposed as set of constraints on the architecture of con v olutional GANs and showed that thus restricted deep con volutional GANs (DCGANs) are capable of learning a hierarchy of representations from object parts to scenes, which are suf ficiently robust to transfer across domains. In this study we apply the DCGAN frame work to the domain of steganography , i.e. practical approaches to concealing information (payload) within another piece of information (stego-container). In particular, we train a generator , images pro- duced by which are less susceptible to steganographic analysis compared to the original images used as stego-containers. At the same time, we require that the induced distribution of the synthetic images approximate well the distribution of the real images in the dataset. Thus we train a generativ e model for image stego-containers by confronting it with tw o deep con volutional adversaries: a discriminator network, which regularizes the output to look like samples from the real dataset, and a steganographic an- alyzer , which aims at detecting if an image conceals a hidden message. The presence of two regularizers in the generator’ s objectiv e resembles the recently proposed multi-target GAN frame work [2]. 2. STEGANOGRAPHY Steganography is a set of algorithms for concealing information in inconspicuous-looking communication and a col- lection of methods to detect and recov er the hidden message from suspicious media (Steganalysis). In steganography the information to be hidden, the payload , is embedded by an algorithm inside a cover medium, the ste go-container . The key drawback is that steganography offers security through obscurity: an embedded message is sent in the hope that a third party won’t detect or discov er it. This mak es pure steganography impractical without cryptography , which deals with secure communication over an insecure channel: the message is scrambled and authenticated with some keyed algorithm before being concealed in a cover medium , [1, 21]. In this respect steganography serves as a layer of weak security by adding a cypherte xt detection and extraction step: encrypted data has much higher entropy than the regular data. Besides information protection and co vert communication, steganography is useful for watermarking in digital rights management and user identification. The simplest and most popular algorithm of unk eyed stego-embedding is called the Least Significant Bit (LSB) match- ing. The main idea is to take the binary representation of a secret message, pad it, and store it in a stego-container by ov erwriting the LSB of each byte within. The cover -media used for the LSB embedding must be resilient with respect to bit-lev el augmentation. In the case of images the least significant bits of each colour channel of each pixel in the giv en image are used to hide the payload. The perturbations introduced by the LSB algorithm do not preserve marginal or joint colour statistics, which despite being imperceptible to a human observer , simplify detection of the hidden payload with machine learning or statistical models. A modification of this method, which addresses this issue to some extent, is the so-called ± 1-embedding: each bit of the message is hidden by randomly adding or subtracting 1 from a pix el’ s colour channel so that the last bit matches it. More sophisticated steganographic schemes modify the digital media adaptiv ely . The key idea is to constrain the embedding to regions of high local entropy , e.g. complex te xtures or noise. Each pixel is assigned an embedding cost and the embedding locations are picked in such a way as to minimize the distortion function D ( I , ˆ I ) = ∑ i , j ρ i j ( I , ˆ I i j ) , where I is the cov er image, ˆ I is the stego-embedding, and ρ i j ( I , ˆ I i j ) is the bounded cost of altering a pixel ( i , j ) in the cov er-image I . The embedding itself is performed by coding methods such as Syndrome-Trellis Codes (STC) [3], which are essentially binary linear con volutional codes represented by parity-check matrix. The state-of-the-art content-adaptiv e stego-embedding algorithms include HUGO [24], which computes the embedding costs based on Subtractive Pix el Adja- cency Matrix (SP AM) features [23]; WO W [9] and S-UNIW ARD [10], which use directional wavelet filters to weigh and pick regions with high entropy , but implement different embedding cost functional. 2.1 Steganalysis The simplest approach to steganalysis is based on special feature extractors, e.g. SP AM [23], SRM [4], combined with traditional machine learning models, such as support v ector classifiers, decision trees, classifier ensembles, et c. W ith the recent overwhelming success of deep learning, specifically in the image classification and generation domain, newer approaches based on deep Con volutional Neural Networks (deep CNN) are gaining popularity . For example, in [27] it is shown that deep CNN with Gaussian activ ation functions achie ve competitive performance with hand-crafted features, and in [25] it is demonstrated that even shallow CNN are able to outperform the usual ML based stego-analysis techniques in terms of the detection accuracy . In this paper we consider steganographic embedding of random bit messages into specifically crafted images using the ± 1-embedding algorithm. The security of the stego-containers is tested against a class of deep conv olutional neural network stego-analyzers, which try to distinguish images with hidden data from the empty ones. 2.2 Problem Statement The total scheme of steganography and ste ganalysis is presented at Fig. 1: • Usually all images are attacked by a stegoanalyser (Eve); • Alice (Steganography algorithm) tries to decei ve Eve; Figure 1: Complete scheme of steganography and steganalysis The disadvantage of the standard steganography approach is inadaptability of the containers and algorithms for Eve. By this we mean that containers don’t adopt (and even kno w) for type of Eve. The goal for this work is to create adaptive containers generator and steganography ne w method. 2.3 T asks for the research W e would like to obtain adaptability of the containers to the given Steganalysis in order to deceiv e it. W e set the following tasks for the current w ork: 1. Adaptive containers generation. • Create a model for image containers generation, that can be used with any Steganography algorithm; • Using of these generated containers should deceive Ev e (steganalysis); • Containers should be adaptive to an y type on Eve. 2. New Steganograph y method: • Create a model for adaptive generation of images with hidden information inside; • T est the quality of encryption-decryption on MNIST and CIF AR-10 datasets The difference between these tw o tasks is the follo wing. In the first model, we would like to build a generator of empty containers (images). This images could be used with any Ste ganography algorithm. In the second model, we would like to generate not only empty images, but to encode the information to them for further extraction. In other words, we would like to obtain an analogy of visual mark ers (such as QR-codes). 3. GENERA TIVE ADVERSARIAL NETWORKS Generativ e Adversarial Networks (GANs) training, [7], is a powerful framework for estimating generativ e models in unsupervised learning setting by way of a two-player minimax game. The generator player attempts to mimic the true data distribution p data ( x ) by learning a transformation function z 7→ G θ ( z ) of random input v alues z , drawn from a tractable distribution p z . The generator receives feedback from the discriminator D φ , which striv es to distinguish synthetic, “fake” samples x = G ( z ) , z ∼ p z , from genuine, “real” samples x ∼ p data . In the original formulation, [7], the learning process of the generator ( G ) and the discriminator ( D ) consists of searching for a saddle point solution of the following optimization problem min θ max φ L ( θ , φ ) = E x ∼ p data [ log D ( x ; φ ) ] + E z ∼ p z [ log ( 1 − D ( G ( z ; θ ) ; φ )) ] , (1) where D ( x ; φ ) is the probability output by the player D that x is a real sample rather then synthetic, and G ( z ; θ ) is the generated sample. In a typical application the ground truth data distrib ution is provided implicitly through its finite sample approximation on the dataset D = ( x i ) m i = 1 . Furthermore, the expectations in the are approximated by sample av erages ov er randomly drawn mini-batches of ( G ( z i ; θ )) B i = 1 for ( z i ) B i = 1 ∼ p z i.i.d. and ( x i ) B i = 1 sampled without replacement from the training set. Despite many advantages, such as generation with a single forward pass and asymptotically consistent data distribution estimation, the major disadv antage of GANs is that training them requires finding a Nash (best response) equilibrium, [6]. Furthermore, since for deep networks G ( · ; θ ) and D ( · ; φ ) the objectiv e is non-con vex w .r .t. the parameters φ and θ , the order of min and max in (1) matters. Therefore [7] propose to solve the problem by iteratively alternating between SG maximization and minimization steps, b ut giving optimizational adv antage to the discriminator player . The idea is to make sev eral gradient ascent steps on L ( θ , φ ) w .r .t. φ , before a single gradient descent step w .r .t. θ . By giving advantage to the discriminator, the proposed approach attempts to approximate what in fact is the genera- tor’ s true objectiv e: θ 7→ max φ L ( θ , φ ) . Ho wev er , since in practice may GANs are trained with alternating single-step updates, [22] propose and justify a simpler joint single-step gradient method: the SGD update moves along the joint di- rection ( ∇ θ L , − ∇ φ L ) obtained via a single back-propagation step. Howe v er , there is no consensus as to what the best training scheme for solving (1) is, [6]. GAN training is also complicated by the different regimes the networks undergo in the process. For instance, during early stage of training the discriminator is prone to becoming excessi vely powerful, which makes the last term of (1) provide weak feedback to the generator . In [7] authors suggest to use L Gen ( θ ; φ ) = − E z ∼ p z [ log D ( G ( z ; θ ) ; φ ) ] as the minimization objective of the generator , and to set − L ( θ , φ ) as the discriminator’ s loss, L Dis ( φ ; θ ) , to be minimized . In spite of changing the loss and making the game no longer zero-sum, this heuristic leads to the same saddle point as demonstrated in [22]. In fig. 2 we depict a sample of synthetic images of a freshly trained DCGAN on the Celebrities dataset [20]. The images indeed look realistic, albeit with occasional artifacts. Figure 2: Sample synthetic images generated by DCGAN 4. STEGANOGRAPHIC GENERA TIVE AD VERSARIAL NETWORKS 4.1 Model description Let I = [ − 1 , 1 ] H × W × 3 be the space of images with dimensions H × W and RGB channel saturation values between − 1 and 1. Let p z be a uniform distribution on Z = [ − 1 , 1 ] d Z , and p data be the distribution of reference images on a subset of I . Finally , the set of messages M is gi ven by { 0 , 1 } d M , d M ≤ H W , and by S m : I 7→ I we denote the LSB embedding of the message m ∈ M in the cover image x ∈ I . W e introduce Steganographic Generativ e Adversarial Networks model (SGAN), which is a zero-sum game between two players: a generator network G : Z 7→ I , which tries to mimic p data , and an adversary consisting of two parts • a discriminator network D : I 7→ [ 0 , 1 ] , which distinguishes synthetic images x = G ( z ) , z ∼ p z , from real x ∼ p data ; • a steganalyzer network A : I 7→ [ 0 , 1 ] , which tries to separate cover images S m ( x ) with payload m ∈ M , from empty images x , x ∼ Q , for some image distribution Q on I . The value function of the g ame is fig. 3 L ( θ , φ , ψ ) = α L Dis ( φ ; θ ) + ( 1 − α ) L San ( ψ ; θ ) → min θ max φ , ψ , (2) L Dis ( φ ; θ ) = E x ∼ p data [ log D ( x ; φ ) ] + E z ∼ p z [ log ( 1 − D ( G ( z ; θ ) ; φ )) ] , (3) L San ( ψ ; θ ) = E z ∼ p z [ E m [ log A ( S m ( G ( z ; θ )) ; ψ ) ] ] + E z ∼ p z [ log ( 1 − A ( G ( z ; θ ) ; ψ )) ] . (4) Here the discriminator and the steganalyzer maximize the likelihood L Dis and L San , respectively , while the generator minimizes the con ve x combination of likelihoods of D and A in (2). The mixing parameter α ∈ ( 0 , 1 ) controls the trade-of f between the importance of realism of generated images and their quality as containers against the steganalysis. Analysis of preliminary experimental results showed that for α ≤ 0 . 5 the generated fails to approximate the distribution of the reference images. This model resembles the recently proposed multi-target GANs framework, [2], which pits the generator against mul- tiple discriminators: a boosted ensemble, a mean aggregated combination of discriminators, or discriminators with adap- tiv ely adjustable power . The main idea of the paper naturally suggests that additional discriminators can be used for regularization, adaptation of the generator’ s output to other domains, or endowing the synthetic samples with certain required properties. The we propose to train the model with the joint simultaneous gradient update scheme proposed in [22]: jointly backpropagate through the networks and update • for D with φ ← φ + γ D ∇ φ L Dis ; • for A with ψ ← ψ + γ A ∇ ψ L San ; • for G with θ ← θ − γ G ∇ θ L , where L is as in (2). The e xpectations are substituted by the empirical a verages o ver the joint mini-batch of images x ∼ p data , noise z ∼ p z , and messages m ∼ { 0 , 1 } d M . It is expected that the SGAN game would, for suitable class of deep CNN and an appropriate training schedule, yield an equilibrium in which the generator produces realistic images capable of concealing messages embedded by LSB against a deep conv olutional steganalyzer . As is in the case of the original formulation [7], if the players were not confined to the class of deep conv olutional networks, the optimal steganalyzer would have been given by the ratio estimator between the distribution of the images q , induced by G ( z ; θ ) for z ∼ p z , and the distribution q s , implicitly defined as the distribution of S m ( x ) over x ∼ q and m ∼ { 0 , 1 } d M . W e expect that the optimal generator would induce a distribution, the variates from which would have uniformly random least significant bit in the colour data of each pixel, since for a random m the stego-embedding S m ( x ) alters x only on the scale of 2 − 7 , and the result is, essentially , a random bit. 4.2 Challenges One of the challenges in the proposed model is the fact that any stego-embedding algorithm introduces a distortion to the cover medium, which is generally a non-dif ferentiable perturbation as a function of the data in the medium. For instance, the LSB matching modifies the colour channel data of the pixels independently , and thus can be represented as a residual-like transformation of each colour channel value: S m ( x ) = x + δ m ( x ) , for any message bit m ∈ { 0 , 1 } and x ∈ [ − 1 , + 1 ] , where δ m ( x ) ∈ { 0 , ± ε } , ε = 2 − 7 , is the distortion. The δ m ( x ) is a random function given by δ m ( x ) = ξ p m ( x + 1 ) ε − 1 , (5) where p m ( z ) is 0 when ∃ k ∈ Z : z − m ∈ [ 2 k , 2 k + 1 ) , i.e. the LSB of the inte ger part of z matches m , and 1 otherwise. The value ξ is + ε if m = 1 and x ∈ [ − 1 , − 1 + ε ) , and − ε if m = 0 and x ∈ (+ 1 − ε , + 1 ] , b ut otherwise an independent random variable ± ε , which is the addition / subtraction mask in the LSB algorithm. The key observation is that for fixed m and ξ this perturbation is fully determined by p m ( z ) , which is constant on the intervals ( k , k + 1 ) with k ∈ { 0 . . . 255 } . While this function is non-differentiable at finitely many points, where the 0 − 1 switching occurs, everywhere else in ( 0 , 256 ) it is constant and has zero derivati v e w .r .t. z . Thus for fixed m and ξ the distortion δ m ( x ) has deriv ativ e zero almost for every x ∈ ( − 1 , 1 ) . In light of this heuristic argument, the following procedure for stego-embedding can be used while training: during the forward pass the embedding x 7→ S m ( x ) is exact, b ut during the back propagation the embedding response is approximate by an identity function S m ( x ) ≈ x . The main drawback of this “linear” approximation is that it provides essentially no gradient feedback to the generator and acts as if nothing was embedded. Thus, although being almost correct it is ill suited for the purpose of learning stego-secure cover entities by design. Therefore we propose another approach: for a fixed message m ∈ { 0 , 1 } and noise ξ ∈ {± 1 } approximate the LSB embedding layer by a dif ferentiable transformation. T o this end we substitute the mismatch indicator in (5) by a sine wa veform with nonlinear gain at e xtremes of its range: p m ( z ) ≈ s β ( z ; m ) = σ β sin ( m − z ) π , (6) where β > 0 determines the fidelity of the approximation and σ ( a ) is the Sigmoid function, a 7→ ( 1 + e − a ) − 1 . The deriv ati ve of the Sigmoid function is a 7→ σ ( a )( 1 − σ ( a )) , and therefore this approximation is not computationally too demanding and provides accurate gradient feedback at the jump points of p m ( z ) . Figure 3: SGAN information flow diagram 4.3 T raining process Stochastic mini-batch Gradient descent update rules for components of SGAN are listed below: • for D the rule is θ D ← θ D + γ D ∇ G L with ∇ G L = ∂ ∂ θ D n E x ∼ p d a t a ( x ) [ log D ( x , θ D ) ] + E z ∼ p noise ( z ) [ log ( 1 − D ( G ( z , θ G ) , θ D )) ] o ; • for S (it is updated similarly to D ): θ S ← θ S + γ S ∇ S L where ∇ S L = ∂ ∂ θ S E z ∼ p noise ( z ) [ log S ( St e go ( G ( z , θ G )) , θ S ) + log ( 1 − S ( G ( z , θ G ) , θ S )) ] ; • for the generator G : θ G ← θ G − γ G ∇ G L with ∇ G L giv en by ∇ G L = ∂ ∂ θ G α E z ∼ p noise ( z ) [ log ( 1 − D ( G ( z , θ G ) , θ D )) ] + ∂ ∂ θ G ( 1 − α ) E z ∼ p noise ( z ) [ log ( S ( St e go ( G ( z , θ G ) , θ S ))) ] + ∂ ∂ θ G ( 1 − α ) E z ∼ p noise ( z ) [ log ( 1 − S ( G ( z , θ G ) , θ S )) ] . The main distinction from the GAN model is that we update G in order to maximize not only the error of D , but to maximize the error of the linear combination of the classifiers D and S . 5. STEGANOGRAPHIC ENCRYPTION GENERA TIVE AD VERSARIAL NETWORKS 5.1 Model Description Steganographic Encryption Generative Adversarial Networks (SEGAN) model was constructed for information en- cryption/ description purposes. It consists of • Alice — A generator network: produce realistic images, that contains hidden information. – Input: Secret Key (binary), Secret Message (binary), Class for generation (y), noise – Output: Image with hidden secret message inside • Bob — A decryption network: extract hidden message from the image. – Input: Image, Secret Key , Class of the image (y) – Output: Secret hidden message • Discriminator — A discriminator, that tries to detect whether the image is real or generated. – Input: Image, Class of the image (y) – Output: 0/1 – generated/real class The full scheme is presented in Fig. 4. This model can be considered as a autoencoder with the high-dimensional hidden representation. In such case Discriminator is considered as such kind of regularization. 5.2 T raining process The training process can be represented as a usual GAN training process, but with some modification. Firstly we should define the loss functions for constructing the updating rules: • Alice (as realistic image Generator) loss. Standard GANs generator loss. L A = E y E m ∼ p message E k ∼ p key E z ∼ p noise [ log ( 1 − C ( θ C , ( θ A , m , k , z , y ) , y )) | y , m , k , z ] (7) • Alice-Bob encryption loss. Here we compute standard l 2 loss between the original message and reconstructed message. L AB = E y E m ∼ p message E k ∼ p key E z ∼ p noise E [( m − B ( θ B , A ( θ A , m , k , z , y ) , k , y )) 2 | y , m , k , z ] → min (8) Figure 4: SEGAN information flow diagram Input : d k — The dimension of a binary key d m — The dimension of a binary message, d n — The dimension of the input noise for epoch in 1 . . . n_e poch do for minibatch fr om the data set do Sample minibatch of d n noise samples { z 1 , . . . , z d n } Sample minibatch of d m noise samples { m 1 , . . . , m d m } Sample minibatch of d k noise samples { k 1 , . . . , k d k } Update A according to loss L A Update C according to loss L C if e poch > 1 then Update A , B according to loss L AB end end end Algorithm 1: Algorithm of training SEGAN • Discriminator’ s loss. This is a usual GANs discriminator loss, witch is calculated as an av erage cross-entropy . L C = E y E x ∼ p d a t a [ log ( C ( θ C , x , y )) | y ] − − E y E m ∼ p message E k ∼ p key E z ∼ p noise [ log ( 1 − C ( θ C , A ( θ A , m , k , z , y ) , y )) | y , m , k , z ] → min (9) The total SEGAN train procedure is presented in Algorithm 1. 6. EXPERIMENTS WITH STEGANOGRAPHIC GENERA TIVE AD VERSARIAL NETWORKS 6.1 Steganographic V ectors Before conducting extensi ve numerical experiments on the security of the LSB embedding in cov er images produced by a SGAN trained generator, we run a simpler experiment as a proof-of-concept. Since the LSB matching embeds each bit of the message into a pixel independently from the its context, we study numerically the stego-security properties of SGAN generated 1-d vectors. 6.1.1 V alidation Protocol In the process of training SGAN for T iterations we obtain a sequence of generators ( G t ) T t = 1 where G t ( · ) = G ( · ; θ t ) and θ t are the parameters of the generator after t minibatch SGD updates. W e use the following empirical validation protocol for the generator after the t -th iteration: 1. Draw a sample S t = ( x i , y i ) M i = 1 , i.e. for i = 1 , ... M (a) independently draw y i ∼ { 0 , 1 } and z i ∼ p z ; (b) get a message m i ∈ M = { 0 , 1 } d M ; (c) synthesize a cover entity x ∗ i = G t ( z i ) and set x i = ( S m i ( x ∗ i ) , if y i = 1 , x ∗ i , if y i = 0 . 2. Assess the performance of an independent steganalyzer A ∗ with K -fold cross validation on S t . This sampling procedure ensures that the examples in the ste go-sample S t are independent and identically distributed. W e also control the div ersity of the embedded messages m i through different message generation scenarios: • Fixed : m i = m 0 , with m 0 picked once from M ; • Pool ( n ) : random m i from M 0 ⊂ M , | M 0 | = n ; • Arbitrary : random m i from M . Under the “Fixed” and “Pool ( n )” variants m 0 and M 0 , respectiv ely , are chosen independently for each sample S t in the above outline. The motiv ation behind controlling the div ersity stems from the idea that by comparing the optimal performance of A ∗ under each scenario it is possible to verify if S m ( G ( z )) ≡ G ( z ) in distribution for independent z ∼ p z and m ∼ M . In fact, with SGAN it is possible to train a generator, that induces an distribution, which is inv ariant under LSB embedding of messages from a specific, non-uniform distribution of messages M just by sampling from it during training. W e use this validation protocol for the experiments in sec. 6.1. Howeer , for the experiments with image generation ( sec. 6.2) we extend the protocol, since in this case we have a reference distribution Q on I , which the generator G aims to replicate. The extension is illustrated in fig. 5. Firstly , se v eral independent real and synthetic training datasets are generated: 1. Draw a synthetic stego-sample S S as outlined abov e; 2. Generate similarly a real stego-sample S R by sampling x ∗ i from Q (sample without replacement if Q is an empirical distribution); Secondly , on each of these datasets we train an independent steganalyzer A ∗ and aggregate them with weighted majority voting to obtain the final pair of steganalyzers A ∗ S and A ∗ R , where the former is based on synthetic training datasets and the latter – on real stego-samples. Finally , we independently generate se veral real and synthetic datasets as outlined above, and used them for cross-validation of the steganalyzers: each one is v alidated on both kinds of test samples to assess if and how well the learned features transfer across real / synthetic domains. Figure 5: Extended validation protocol for experiments in sec. 6.1. 6.1.2 Details of the Experiment The generator G is a 1-d con v olutional neural network (CNN) that expands the input z ∼ N d I ( 0 , I d I ) from R d I into [ − 1 , + 1 ] d O , where d I = 4 and d O = 16. The generator uses a series of 1-d transposed conv olution layers with nonunit strides and the ReLU nonliearity , a 7→ max { a , 0 } , to upsample the input noise into a intermediate 32 × d O state matrix, which is finally passed through a 1-d con v olution layer with T anh nonlinearity , a 7→ e a − e − a e a + e − a , to ensure [ − 1 , + 1 ] d O output. The steganalyzer A is also a 1-d CNN which takes a sample x ∈ [ − 1 , + 1 ] d O and feeds the input through a set of 1-d con v olution layers with Leaky ReLU , a 7→ max { α a , a } for α = 1 20 to get the 256 × 4 matrix of intermediate features. The matrix is passed through two fully connected layers with the Leak y ReLU and 512 units each. The final layer of A outputs a single logit score, which reflects the degree to which the analyzer is “confident” that the sample x contains a message hidden with the LSB embedding. This network A , as well as G , has a Batch Normalization (BN) layer, [12], before the nonlinearity , except for the output layers. The losses are set-up as in (2) and (4), but without an explicit “discriminator” player . W e simulate its feedback by imposing a total variation with l 2 loss on the generator’ s output: L Dis ( θ ) = E z ∼ p z " d O ∑ i = 1 C tv x i − x i − 1 + C l 2 1 2 x 2 i # x = G ( z ; θ ) x 0 = x d O . The relativ e weight α in the generator’ s loss is set to α = C san C san + C tv + C l 2 for C san = 2, C tv = 10 − 2 , and C l 2 = 10 − 1 . W e train SGAN generator for T = 75 · 10 3 minibatch iterations of 256 pairs of noise and message vectors. W e set d M = d O and thus embed 1 bit of message in each element of the cover vector , achieving the flat vector analogue of image’ s 1 bit-per -pixel embedding density . SGAN networks are trained with Adam, [14], with default parameter values and learning rates 2 − 14 and 2 − 18 for the steganalyzer ( A ) and the generator ( G ), respectiv ely . W e use dif ferent learning rates to make the generator adapt slowly to the steganalyzer’ s feedback, while allowing the analyzer to learn to detect hidden data in the generator’ s output more quickly , [8]. The training was carried out with the strategy proposed in [7]: 4 updates of A per 1 update of G . The scale parameter β of (6) was set to 8. W e run the validation protocol with the sample size M = 2 16 and the number of folds K set to 7. In this setting we validate two different independent steganalyzers A ∗ : a Random F orest classifier ensemble with 101 trees and a 1 -d CNN similar to the network used during SGAN training. The only difference of the CNN A ∗ is that it yields two v alues and passes them through the Softmax nonlinearity , ( a , b ) 7→ 1 1 + e b − a , 1 1 + e a − b . The con volutional A ∗ is trained using Adam SGD updates with learning rate 2 − 14 and weight decay of 2 − 3 ( l 2 parameter regularizer). T o simplify the task of the independent steganalyzer A ∗ we use hand-crafted features instead of raw input vectors. These features are deriv ed with the 8-bit re versal procedure and are strongly tailored to the LSB embedding. The main idea is that through “re versal” of the bit order in each element these features amplify the signal from the stego-embedding, which is naturally a weak high frequency distortion. Each value v ∈ [ − 1 , + 1 ] in the input vector is transformed as follows: • set u = ( v + 1 ) · ε for ε = 255 2 and let ˆ u = b u c ; • conv ert ˆ u to an unsigned 8-bit value; • produce 1 ε ( rev ( ˆ u ) + ( u − ˆ u )) − 1; where rev ( · ) : { 0 , 1 } 8 7→ { 0 , 1 } 8 rev erses the bit order: rev ( b 0 , b 1 , · · · , b 7 ) = ( b 7 , b 6 , · · · , b 0 ) , with the bit at the first position in the sequence being the least significant bit (the LSB bit endianness). W ith these features any change in the least significant bit would affect the whole value on the 2 7 scale instead of 2 0 , and make more prominent the arithmetic effects in the LSB matching due to ± 1 addition. 6.1.3 Results The performance of the Random Forest A ∗ on the output vectors of the generator after T iterations is shown in ta- ble 1, of the 1-d CNN – in T able 2. Both tables were obtained via the validation protocol and the feature preprocessing outlined above. It is worth noting that in both runs all metrics drop dramatically when the variety of embedded messages is increased, which lends evidence to the inv ariance of the induced distribution under the LSB distortion. Numerical experiments were also conducted on ra w features, but both ste ganalyzers performed poorly . W e also experimented with the strength of the feedback to G from the steganalyzer A in SGAN training by setting C tv = 0 and changing C san . The trained generator appears to be capable of successfully hiding message bits against the random forest A ∗ ev en at the proportion C l 2 : C san as low as 99 : 1. This can be attributed to simplicity of the generator’ s goal in this experiment. For C san = 10 − 3 and C l 2 = 1 − C san , howe v er , the generator failed to produce LSB-in variant output distribution ev en in the “ Arbitrary” message generation scenario. This implies that the feedback provided by (4) is relev ant to the task of generating co ver vectors, which make the LSB matching embedding less susceptible to ste ganalysis. The figures 6 and 7 show the performance dynamics of the tested steganalyzers. The generator con verges to the desired output distribution is very quickly: in all SGAN re-runs with moderate to high values of C san the steganalyzer A ∗ fails to discriminate between empty and non-empty vectors after at most 20 of training. Note that in the figures the proposed v alidation protocol was performed on each iteration during the first 100 SGAN iterations, and only on ev ery 100-th iteration onward. T able 1: Performance metrics of the Random Forest A ∗ on the cover vectors generated after all iterations of training ( µ ± 5 σ scaled by 100). metric R OC-A UC F1-score Accuracy OOB type Fixed 100 . 0 ± 0 . 0 100 . 0 ± 0 . 0 100 . 0 ± 0 . 0 100 . 0 Pool (64) 99 . 9 ± 0 . 1 99 . 9 ± 0 . 2 99 . 9 ± 0 . 2 99 . 9 Pool (256) 99 . 7 ± 0 . 2 98 . 9 ± 0 . 6 98 . 9 ± 0 . 6 98 . 8 Pool (1024) 84 . 7 ± 3 . 0 77 . 2 ± 3 . 2 76 . 7 ± 2 . 9 75 . 2 Pool (4096) 54 . 8 ± 2 . 7 53 . 1 ± 3 . 8 53 . 5 ± 3 . 2 52 . 4 Arbitrary 49 . 6 ± 3 . 0 50 . 4 ± 2 . 7 49 . 8 ± 2 . 2 49 . 9 T able 2: Performance metrics of the 1-d CNN A ∗ on the cov er vectors generated after all iterations of training ( µ ± 5 σ scaled by 100). metric R OC-A UC F1-score Accuracy type Fixed 100 . 0 ± 0 . 0 99 . 9 ± 0 . 1 99 . 9 ± 0 . 1 Pool (64) 98 . 9 ± 0 . 3 95 . 6 ± 1 . 3 95 . 5 ± 1 . 2 Pool (256) 92 . 1 ± 1 . 7 84 . 7 ± 2 . 6 84 . 5 ± 2 . 3 Pool (1024) 70 . 0 ± 4 . 5 64 . 9 ± 5 . 5 64 . 6 ± 3 . 7 Pool (4096) 56 . 6 ± 2 . 5 55 . 1 ± 4 . 8 54 . 8 ± 2 . 8 Arbitrary 50 . 0 ± 1 . 9 49 . 9 ± 5 . 9 50 . 2 ± 2 . 6 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 iteration 0.5 0.6 0.7 0.8 0.9 1.0 metric R O C - A U C ± 5 F 1 - s c o r e ± 5 Figure 6: k -fold CV performance of A ∗ (Random F orest) on synthetic images produced by the generator at dif ferent stages of training. The main conclusion from this set of experiments is that the SGAN model (2) is useful in generating co ver vectors, the unconditional distribution of which is inv ariant under the LSB embedding distortion (5). Also the embedding approxima- tion (6) is adequate and provides relev ant gradient feedback. In the next section we experiment with generating realistic cov er images, which make the simple LSB embedding more steg anographically secure. 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 iteration 0.5 0.6 0.7 0.8 0.9 1.0 metric R O C - A U C ± 5 F 1 - s c o r e ± 5 Figure 7: k -fold CV performance of A ∗ (1-d CNN) on synthetic images generated at different stages during the SGAN trainnig. 6.2 Steganographic Images 6.2.1 Data Description In our experiments we use the Celebrities dataset [20] that contains 200 000 images. All images were cropped to 64 × 64 pixels. For steganalysis purposes we consider 10% of data as a test set. W e denote the train set by A , the test set by B and steganography algorithms used for hiding information by St e go ( x ) . After embedding some secret information we get the train set A + St e go ( A ) and the test set B + St e go ( B ) , W e end up with 380 000 images for steganalysis training and 20 000 for testing. For training the SGAN model we used all 200 000 cropped images. After 8 epochs of training our SGAN produces images displayed in fig. 8. For information embedding we use the ± 1-embedding algorithm with a payload size equal to 0 . 4 bits per pixel for only one channel out of three. As a text for embedding we use randomly selected excerpts from some article from The New Y ork Times. 6.2.2 Experimental Setup In this section we describe the SGAN model structure. By C2D-BN-LR we denote the following structural block of a con v olutional neural network: Conv2d → Batch Normalization → Leaky ReLU. The Steganalyser network S and the Image Discriminator network have similar structure: four C2D-BN-LR layers, then a fully connected layer (1 neuron) → Sigmoid function is used to compute an output. The Image generator network G is (in order) a fully-connected layer (8192 neurons), four C2D-BN-LR with Fractional-Strided conv olution, then the Hyperbolic tangent function layer is used to compute normalised output. The SGAN model is trained to solve (2) using the Adam optimization algorithm [15] with the learning rate 2 − 4 and update parameters β 1 = 0 . 5 and β 2 = 0 . 999. For each mini-batch of images we update weights of D and S once, then we update weights of G twice. In the following experiments, in addition to the steganalyser S we use an independent steganalyser S ∗ . W e define a filter F ( 0 ) that is special for steganalysis applications (see [26], [28], [30]), as follo ws F ( 0 ) = 1 12 − 1 2 − 2 2 − 1 2 − 6 8 − 6 2 − 2 8 − 12 8 − 2 2 − 6 8 − 6 2 − 1 2 − 2 2 − 1 . Figure 8: Examples of images, generated by SGAN after training for 8 epochs on the Celebrities dataset T able 3: Accuracy of the ste ganalyser S ∗ trained on real images T ype of a test set \ Image generator SGANs DCGANs Real images 0.962 Generated images 0.501 0.522 The structure of the individual steganalyser S ∗ has the form: 2D conv olution with F ( 0 ) filter → Con v2D → Con v2D → Max Pooling → Conv2D → Conv2D → Max Pooling → Fully connected layer (1024 neurons) → Fully connected layer (1 neuron) → Sigmoid function for output. This structure provides state-of-the-art steganalysis accuracy , [26], and the filter F ( 0 ) allows to increase con vergence speed of the ste ganalyser S ∗ training. For training of this steganalyser we use the Adam optimization algorithm on the loss (2) with the learning rate equal to 5 − 6 , β 1 = 0 . 9, β 2 = 0 . 999. As a loss function we use a binary cross-entropy . The setup of experiments can be described as follo ws: • W e train and use the SGAN and/or DCGAN model to generate images to be used as containers; • W e train the independent steganalyser S ∗ using either real images (sec. 6.2.3) or generated images (sec. 6.2.4); • W e measure the accuracy of the steganalyser S ∗ . 6.2.3 T raining/T esting on Real Images In this set of e xperiments we train the independent steganalyser S ∗ on real images. Results are provided in tab . 3. From the results we conclude that ev en the usual DCGAN generate synthetic container images, that can easily deceiv e a ste- ganalyser . Although containers generated by on the SGAN model, which is specially designed to decei ve the steg analyser , improv e on the result of DCGAN, the improv ement is not significant. These results also suggest that S ∗ may have been ov erfit to the data distribution of the real images, and failed to generalize to the GAN approximation of the data distribution in the synthetic image domain. Another interesting result follows from an experiment with HUGO steganograph y algorithm and SGAN generated containers: • W e use the ± 1-embedding algorithm for training the SGAN model; • W e test the HUGO steganography analyzer on real images and on images, generated by the SGAN model. Containers generated by the SGAN model decrease HUGO steganalysis accuracy from 0 . 624 to 0 . 499, which demon- strates that the Steganographic Generati ve Adv ersarial Networks can potentially be used as a uni versal tool for generating Steganography containers tuned to decei v e any specific ste ganalysis algorithm. 6.2.4 T raining/T esting on Generated Images In this set of experiments we train the steganalyser S ∗ on images, generated using the DCGAN model from the prior noise distribution p noise ( z ) for some fixed seed v alue. In all experiments the size of the train set is fix ed at ≈ 160 000. After training we test the analyser S ∗ on images generated with the DCGAN model according to the following experimental setups: C1 . W e use the same seed value; C1 . W e use the same seed value, as in C1 , and we additionally tune the DCGAN model for se veral epochs. T able 4: T raining/testing on generated images according to experimental conditions C1 - C3 Experimental Conditions Accuracy C1 0.982 C3 0.499 The experiment results in tab . 4 indicate that additional tuning of the image generator G make this effect ev en more pronounced. W e also conduct an experiment with classification of generated images without steganographic embeddings. For this purposes we train a DCGAN conditional model on the MNIST dataset, and train a separate classifier for the MNIST classification task. The trained classifier achiev ed almost perfect accuracy both on the held-out real MNIST dataset, and on synthetic images produced by the DCGAN. This provides evidence that it is possible to train an image classifier that shows acceptable accuracy both on real and synthetic images. Howe v er it is the artificial generation of image containers that breaks the usual approaches to steganalysis. 7. INFORMA TION ENCRYPTION WITH SEGAN 7.1 Data description For the experiments in the current section we use MNIST [19] and CIF AR-10 [18] data sets. Both datasets can be considered as a benchmark in Deep Learning and Compute V ision. • The MNIST dataset is a set of gray handwritten digits from 0 to 9 of size 28 × 28; • The CIF AR-10 dataset consists of 60000 32 × 32 RGB images of totally 10 classes: airplane, automobile, bird, cat, deer , dog, frog, horse, ship, truck. 7.2 Encryption experiments W e consider SEGAN model as a model for information encryption. It allo ws to generate images with hidden informa- tion inside. In the experimental section we should • check if the generated images are realistic; • look at the quality of encryption. The examples of generate images for MNIST dataset are presented in Fig. 9. W e observe the full realism in the generated images. Figure 9: Sample synthetic images: MNIST The examples of generate images for CIF AR dataset are presented in Fig. 10. This images are small and looks quit realistic. This level of realism occurs because of current limitation of generati ve modeling. Figure 10: Sample synthetic images: CIF AR-10 The quality of encryption is presented in T able 5. As a measure of quality we considered E M [ # { i : M i 6 = B ( A ( M i )) } ] , which means the average percentage of reconstructed pixels. As we can see, our new model allows to encrypt-decrypt messages with the quality of almost 1. It is a little bit harder to encrypt longer messages that short. T able 5: Quality of encryption/decryption (% of reconstruction bits) Dataset \Number of bits 16 bit 32 bit 64 bit MNIST 99.98 99.54 98.65 CIF AR-10 99.96 99.82 99.81 8. CONCLUSIONS In this work, 1. W e open a new field for applications of Generativ e Adversarial Networks, namely , container generation for steganog- raphy applications; 2. W e consider the ± 1-embedding algorithm and test novel approaches to more steganalysis-secure information em- bedding: we demonstrate that both SGAN and DCGAN models are capable of decreasing the detection accuracy of a steganalysis method almost to that of a random classifier; 3. A model for secure adaptive ste ganographic containers generation has been presented; 4. A number of ways to adopt in order to deceiv e steganalysis has been proposed; 5. A new GAN-based steganograph y model has been proposed and tested on MNIST and CIF AR-10 dataset; 6. As a result an article ([31]) has been written and submitted to NIPS 2016, W orkshop on Adversarial Training. This paper cross-checks and significantly extends results of that initial paper . References [1] Abbas Cheddad, Joan Condell, Ke vin Curran, and Paul Mc Ke vitt. Digital image steganograph y: Survey and analysis of current methods. Signal Pr ocessing , 90(3):727 – 752, 2010. [2] Ishan Durugkar, Ian Gemp, and Sridhar Mahade van. Generativ e multi-adversarial networks. arXiv pr eprint arXiv:1611.01673 , 2016. [3] T omáš Filler , Jan Judas, and Jessica Fridrich. Minimizing additi ve distortion in steganograph y using syndrome-trellis codes. IEEE T ransactions on Information F or ensics and Security , 6(3):920–935, Sept 2011. [4] Jessica Fridrich and Jan Kodo vsk ` y. Rich models for steganalysis of digital images. IEEE T r ansactions on Information F or ensics and Security , 7(3):868–882, 2012. [5] Y aroslav Ganin, Evgeniya Ustinov a, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and V ictor Lempitsky. Domain-adversarial training of neural networks. ArXiv e-prints , May 2015. [6] Ian Goodfellow. Nips 2016 tutorial: Generativ e adversarial networks. ArXiv e-prints , December 2017. [7] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. Generative adv ersarial nets. pages 2672–2680, 2014. [8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. Gans trained by a two time-scale update rule con v erge to a nash equilibrium. ArXiv e-prints , June. [9] V ojt ˇ ech Holub and Jessica Fridrich. Designing steganographic distortion using directional filters. In WIFS , 2012. [10] V ojt ˇ ech Holub, Jessica Fridrich, and T omáš Denemark. Uni versal distortion function for steg anography in an arbitrary domain. EURASIP Journal on Information Security , 2014(1):1–13, 2014. [11] Zhiting Hu, Zichao Y ang, Ruslan Salakhutdinov, and Eric P . Xing. On unifying deep generative models. ArXiv e-prints , June 2017. [12] Sergey Ioffe and Christian Szegedy . Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. arXiv preprint , 2015. [13] Phillip Isola, Jun-Y an Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver- sarial networks. ArXiv e-prints , November 2016. [14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint , 2014. [15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint , 2014. [16] Diederik P . Kingma, Danilo J. Rezende, Shakir Mohamed, and Max W elling. Semi-supervised learning with deep generativ e models. ArXiv e-prints , June 2014. [17] Diederik P . Kingma and Max W elling. Auto-encoding variational bayes. ArXiv e-prints , December 2013. [18] Alex Krizhe vsky and G Hinton. Con volutional deep belief networks on cifar -10. Unpublished manuscript , 40, 2010. [19] Y ann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/ , 1998. [20] Ziwei Liu, Ping Luo, Xiaogang W ang, and Xiaoou T ang. Deep learning face attrib utes in the wild. In Pr oceedings of International Confer ence on Computer V ision (ICCV) , 2015. [21] Chandreyee Maiti, Debanjana Baksi, Ipsita Zamider , Pinky Gorai, and Dakshina Ranjan Kisku. Data Hiding in Images Using Some Efficient Stegano graphy T echniques , pages 195–203. Springer Berlin Heidelberg, Berlin, Hei- delberg, 2011. [22] Sebastian No wozin, Botond Cseke, and Ryota T omioka. f -gan: Training generative neural samplers using variational div ergence minimization. ArXiv e-prints , June 2016. [23] T omáš Pevn ` y, Patrick Bas, and Jessica Fridrich. Steganalysis by subtractiv e pixel adjacency matrix. IEEE T ransac- tions on Information F or ensics and Security , 5(2):215–224, 2010. [24] T omáš Pevný, T omáš Filler, and Patrick Bas. Using high-dimensional image models to perform highly undetectable steganography . In Information Hiding , page 2010, Calgary , Canada, June 2010. [25] Lionel Pibre, Pasquet Jérôme, Dino Ienco, and Marc Chaumont. Deep learning for steganalysis is better than a rich model with an ensemble classifier , and is natively robust to the cover source-mismatch. ArXiv e-prints , November 2015. [26] Lionel Pibre, P asquet Jérôme, Dino Ienco, and Marc Chaumont. Deep learning for steganalysis is better than a rich model with an ensemble classifier , and is nati vely robust to the cover source-mismatch. arXiv pr eprint arXiv:1511.04855 , 2015. [27] Y inlong Qian, Jing Dong, W ei W ang, and T ieniu T an. Deep learning for steganalysis via conv olutional neural networks. In SPIE/IS&T Electr onic Imaging , pages 94090J–94090J. International Society for Optics and Photonics, 2015. [28] Y inlong Qian, Jing Dong, W ei W ang, and T ieniu T an. Deep learning for steganalysis via conv olutional neural networks. In IS&T/SPIE Electr onic Imaging , pages 94090J–94090J. International Society for Optics and Photonics, 2015. [29] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep con volutional generativ e adversarial netw orks. arXiv preprint , 2015. [30] Shunquan T an and Bin Li. Stacked conv olutional auto-encoders for steganalysis of digital images. In Asia-P acific Signal and Information Pr ocessing Association, 2014 Annual Summit and Conference (APSIP A) , pages 1–4. IEEE, 2014. [31] Denis V olkhonskiy , Ivan Nazarov , Boris Borisenko, and Evgeny Burnaev . Steganographic generative adversarial networks. W orkshop on Adversarial T raining, Neur al Information Pr ocessing Systems , 2016.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment