ai 数据比对:一文归纳AI数据增强之法
ai 数据比对:一文归纳AI数据增强之法# Mixup def mixup_batch(x y step batch_size alpha=0.2): """ get batch data :param x: training data :param y: one-hot label :param step: step :param batch_size: batch size :param alpha: hyper-parameter α default as 0.2 :return: x y """ candidates_data candidates_label = x y offset = (step * batch_size) % (candidates_data.s
- Mixup
 
Mixup算法的核心思想是按一定的比例随机混合两个训练样本及其标签,这种混合方式不仅能够增加样本的多样性,且能够使决策边界更加平滑,增强了难例样本的识别,模型的鲁棒性得到提升。其方法可以分为两步:
1、从原始训练数据中随机选取的两个样本(xi yi) and (xj yj)。其中y(原始label)用one-hot 编码。
2、对两个样本按比例组合,形成新的样本和带权重的标签
x˜ = λxi   (1 − λ)xj  
y˜ = λyi   (1 − λ)yj  
    
最终的loss为各标签上分别计算cross-entropy loss,加权求和。其中 λ ∈ [0 1], λ是mixup的超参数,控制两个样本插值的强度。

# Mixup
def mixup_batch(x  y  step  batch_size  alpha=0.2):
    """
    get batch data
    :param x: training data
    :param y: one-hot label
    :param step: step
    :param batch_size: batch size
    :param alpha: hyper-parameter α  default as 0.2
    :return:  x y 
    """
    candidates_data  candidates_label = x  y
    offset = (step * batch_size) % (candidates_data.shape[0] - batch_size)
 
    # get batch data
    train_features_batch = candidates_data[offset:(offset   batch_size)]
    train_labels_batch = candidates_label[offset:(offset   batch_size)]
    if alpha == 0:
        return train_features_batch  train_labels_batch
    if alpha > 0:
        weight = np.random.beta(alpha  alpha  batch_size)
        x_weight = weight.reshape(batch_size  1)
        y_weight = weight.reshape(batch_size  1)
        index = np.random.permutation(batch_size)
        x1  x2 = train_features_batch  train_features_batch[index]
        x = x1 * x_weight   x2 * (1 - x_weight)
        y1  y2 = train_labels_batch  train_labels_batch[index]
        y = y1 * y_weight   y2 * (1 - y_weight)
        return x  y 
3 基于深度学习的数据增强3.1 特征空间的数据增强
    
不同于传统在输入空间变换的数据增强方法,神经网络可将输入样本映射为网络层的低维向量(表征学习),从而直接在学习的特征空间进行组合变换等进行数据增强,如MoEx方法等。

生成模型如变分自编码网络(Variational Auto-Encoding network VAE)和生成对抗网络(Generative Adversarial Network GAN),其生成样本的方法也可以用于数据增强。这种基于网络合成的方法相比于传统的数据增强技术虽然过程更加复杂 但是生成的样本更加多样。
- 变分自编码器VAE
变分自编码器(Variational Autoencoder,VAE)其基本思路是:将真实样本通过编码器网络变换成一个理想的数据分布,然后把数据分布再传递给解码器网络,构造出生成样本,模型训练学习的过程是使生成样本与真实样本足够接近。 

# VAE模型
class VAE(keras.Model):
    ...
    def train_step(self  data):
        with tf.GradientTape() as tape:
            z_mean  z_log_var  z = self.encoder(data)
            reconstruction = self.decoder(z)
            reconstruction_loss = tf.reduce_mean(
                tf.reduce_sum(
                    keras.losses.binary_crossentropy(data  reconstruction)  axis=(1  2)
                )
            )
            kl_loss = -0.5 * (1   z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
            kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss  axis=1))
            total_loss = reconstruction_loss   kl_loss
        grads = tape.gradient(total_loss  self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads  self.trainable_weights))
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)
        return {
            "loss": self.total_loss_tracker.result() 
            "reconstruction_loss": self.reconstruction_loss_tracker.result() 
            "kl_loss": self.kl_loss_tracker.result() 
        } 
- 生成对抗网络GAN
生成对抗网络-GAN(Generative Adversarial Network) 由生成网络(Generator G)和判别网络(Discriminator D)两部分组成, 生成网络构成一个映射函数_G_: Z_→_X(输入噪声_z_ 输出生成的图像数据_x_) 判别网络判别输入是来自真实数据还是生成网络生成的数据。 

# DCGAN模型
class GAN(keras.Model):
    ...
    def train_step(self  real_images):
        batch_size = tf.shape(real_images)[0]
        random_latent_vectors = tf.random.normal(shape=(batch_size  self.latent_dim))
        # G: Z→X(输入噪声z  输出生成的图像数据x)
        generated_images = self.generator(random_latent_vectors)
        # 合并生成及真实的样本并赋判定的标签
        combined_images = tf.concat([generated_images  real_images]  axis=0)
        labels = tf.concat(
            [tf.ones((batch_size  1))  tf.zeros((batch_size  1))]  axis=0
        )
        # 标签加入随机噪声
        labels  = 0.05 * tf.random.uniform(tf.shape(labels))
        # 训练判定网络
        with tf.GradientTape() as tape:
            predictions = self.discriminator(combined_images)
            d_loss = self.loss_fn(labels  predictions)
        grads = tape.gradient(d_loss  self.discriminator.trainable_weights)
        self.d_optimizer.apply_gradients(
            zip(grads  self.discriminator.trainable_weights)
        )
        
        random_latent_vectors = tf.random.normal(shape=(batch_size  self.latent_dim))
        # 赋生成网络样本的标签(都赋为真实样本)
        misleading_labels = tf.zeros((batch_size  1))
        # 训练生成网络
        with tf.GradientTape() as tape:
            predictions = self.discriminator(self.generator(random_latent_vectors))
            g_loss = self.loss_fn(misleading_labels  predictions)
        grads = tape.gradient(g_loss  self.generator.trainable_weights)
        self.g_optimizer.apply_gradients(zip(grads  self.generator.trainable_weights))
        # 更新损失
        self.d_loss_metric.update_state(d_loss)
        self.g_loss_metric.update_state(g_loss)
        return {
            "d_loss": self.d_loss_metric.result() 
            "g_loss": self.g_loss_metric.result() 
        }
3.3 基于神经风格迁移的数据增强
    
神经风格迁移(Neural Style Transfer)可以在保留原始内容的同时,将一个图像的样式转移到另一个图像上。除了实现类似色彩空间照明转换,还可以生成不同的纹理和艺术风格。




