fixed random mask problem; added EDDI strategy
This commit is contained in:
LaurantChao 2019-04-03 16:59:27 +01:00 коммит произвёл GitHub
Родитель 99aa1227c2
Коммит c1f4cb19b8
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
8 изменённых файлов: 1659 добавлений и 0 удалений

81
README.txt Normal file
Просмотреть файл

@ -0,0 +1,81 @@
EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE (https://arxiv.org/pdf/1809.11142.pdf)
For the ease of evaluation, we provide two versions of the code.
The first version only contains the missing data imputation model using Partial VAE.
The second version presents both training and partial VAE and use the single optimal order strategy for feature selection (SING in the paper).
*****Version One: Missing Data Imputation*****
This code implements partial VAE (PNP) part demonstrated on a UCI dataset.
To run this code:
python main_train_and_impute.py --epochs 3000 --latent_dim 10 --p 0.99 --data_dir your_directory/data/boston/ --output_dir your_directory/model/
Input:
In this example, we use the Boston house dataset. (https://archive.ics.uci.edu/ml/machine-learning-databases/housing/)
We split the dataset into train and test set in our code.
Random test set entries are removed for imputation quality evaluation.
Output:
This code will save the trained model to your_directory/model. To impute new data (test dataset in this case), we can load the pre-trained model.
In case of missing data in the training set, you can use the same train set as a test set to obtain the imputing results.
The code also computes the test RMSE for imputed missing data and the result is printed on the console.
*****Version Two: Single Optimal Feature Ordering*****
This code implements partial VAE (PNP) + SING together demonstrated on a UCI dataset.
To run this code:
python main_active_learning.py --epochs 3000 --latent_dim 10 --p 0.99 --data_dir your_directory/data/boston/ --output_dir your_directory/model/
Input:
In this example, we use the Boston house dataset. (https://archive.ics.uci.edu/ml/machine-learning-databases/housing/)
We split the dataset into train and test set in our code.
Random test set entries are removed for imputation quality evaluation.
Output:
This code will run Partial VAE on the training set and use the trained model for the sequential feature selection as we presented in the paper.
This code will output the result comparing RAND and SING in the paper in the form of information curve as presented in the paper.
This code will also output the information gain for each feature selection step. These are stored in your_directory/model.
******* Possible Arguments ****
- epochs: number of epochs.
- latent_dim: size of latent space of partial VAE.
- p: upper bound for artificial missingness probability. For example, if set to 0.9, then during each training epoch, the algorithm will
randomly choose a probability smaller than 0.9, and randomly drops observations according to this probability.
Our suggestion is that if original dataset already contains missing data, you can just set p to 0.
- batch_size: minibatch size for training. default: 100
- iteration: iterations (number of mini batches) used per epoch. set to -1 to run the full epoch.
If your dataset is large, please set to other values such as 10.
- K: the dimension of the feature map (h) dimension of PNP encoder. Default: 20
- M: Number of MC samples when perform imputing. Default: 50
- data_dir: Directory where UCI dataset is stored.
- output_dir: Directory where the trained model will be stored and loaded.
Other comments:
- We assume that the data is stored in an excel file named d0.xls,
and we assume that the last column is the target variable of interest (only used in active learning)
you should modify the load data section according to your data.
- Note that this code assumes a Gaussian noise real-valued data. You may need to modify the likelihood function for other types of data.
- In preprocessing, we chose to squash the data to the range of 0 and 1. Therefore our decoder output has also been squashed
by a sigmoid function. If you wish to change the preprocessing setting, you may also need to change the decoder setting accordingly.
This can be found in coding.py.
File Structure:
- main functions:
main_train_impute.py: implements the training of partial VAE (PNP) part demonstrated on a UCI dataset.
main_active_learning.py: implements EDDI strategy, together with a global single ordering strategy based on partial VAE demonstrated on a UCI dataset
it will also generate a information curve plot.
- decoder-encoder functions: coding.py
- partial VAE class:p_vae.py
- training-impute functions: train_and_test_functions.py
- training-active learning functions:active_learning_functions.py
- active learning visualization: boston_bar_plot.py, this will visualize the decision process of eddi on Boston Housing data.
- data: data/boston/d0.xls
Dependencies:
- tensorflow 1.4.0
- scipy 1.10
- sklearn 0.19.1
- argparse 1.1

Просмотреть файл

@ -0,0 +1,405 @@
from p_vae import *
from codings import *
import numpy as np
import tensorflow as tf
from scipy.stats import bernoulli
import argparse
import os
import random
from random import sample
#### parser configurations
parser = argparse.ArgumentParser(
description='EDDI')
parser.add_argument(
'--epochs',
type=int,
default=3000,
metavar='N_eps',
help='number of epochs to train (default: 3000)')
parser.add_argument(
'--latent_dim',
type=int,
default=10,
metavar='LD',
help='latent dimension (default: 10)')
parser.add_argument(
'--p',
type=float,
default=0.7,
metavar='probability',
help='dropout probability of artificial missingness during training')
parser.add_argument(
'--iteration',
type=int,
default=-1,
metavar='it',
help='iterations per epoch. set to -1 to run the full epoch. ')
parser.add_argument(
'--batch_size',
type=int,
default=100,
metavar='batch',
help='Mini Batch size per epoch. ')
parser.add_argument(
'--K',
type=int,
default=20,
metavar='K',
help='Dimension of PNP feature map ')
parser.add_argument(
'--M',
type=int,
default=50,
metavar='M',
help='Number of MC samples when perform imputing')
parser.add_argument(
'--repeat',
type=int,
default=10,
metavar='repeat',
help='Number of repeats of the active learning experiment')
parser.add_argument(
'--output_dir',
type=str,
default=os.getenv('PT_OUTPUT_DIR', '/tmp'))
parser.add_argument(
'--data_dir',
type=str,
default=os.getenv('PT_DATA_DIR', 'data'),
help='Directory where UCI dataset is stored.')
args = parser.parse_args()
#### Set directories
UCI = args.data_dir
ENCODER_WEIGHTS = os.path.join(args.output_dir, 'encoder.tensorflow')
FINETUNED_DECODER_WEIGHTS = os.path.join(args.output_dir, 'generator.tensorflow')
rs = 42 # random seed
def p_vae_active_learning(Data_train,mask_train,Data_test,mask_test,epochs,latent_dim,batch_size,p,K,M,Repeat,estimation_method=0):
'''
This function loads a pretrained p-VAE model, and performs active learning using single global strategy.
Note that we assume that the last column of x is the target variable of interest
:param Data_train: training data matrix
:param mask_train: mask matrix that indicates the missingness of training data. 1=observed, 0 = missing
:param Data_test: test data matrix
:param mask_test: mask matrix that indicates the missingness of test data. 1=observed, 0 = missing
:param latent_dim: latent dimension of partial VAE.
:param K: dimension of feature map of PNP encoder
:param M: number of samples used for MC sampling
:param Repeat: number of repeats.
:param estimation_method: what method to use for single ordering information reward estimation.
In order to calculate the single best ordering, we need to somehow marginalize (average) the
information reward over the data set (in this case, the test set).
we provide two methods of marginalization.
- estimation_method = 0: information reward marginalized using the model distribution p_{vae_model}(x_o).
- estimation_method = 1: information reward marginalized using the data distribution p_{data}(x_o)
:return: None (active learning results are saved to args.output_dir)
'''
for r in range(Repeat):
## train partial VAE
tf.reset_default_graph()
vae = train_p_vae(Data_train,mask_train, epochs, latent_dim,batch_size, p, K,10)
n_test = Data_test.shape[0]
n_train = Data_train.shape[0]
OBS_DIM = Data_test.shape[1]
# kwargs = {
# 'K': K,
# 'obs_distrib': "Gaussian",
# 'latent_dim': latent_dim,
# 'encoder': PNP_fc_uci_encoder,
# 'decoder': fc_uci_decoder,
# 'obs_dim': OBS_DIM,
# 'load_model': 1,
# 'decoder_path': FINETUNED_DECODER_WEIGHTS,
# 'encoder_path': ENCODER_WEIGHTS,
# }
# vae = PN_Plus_VAE(**kwargs)
## create arrays to store results
if r == 0:
# information curves
information_curve_RAND = np.zeros(
(Repeat, n_test, OBS_DIM - 1 + 1))
information_curve_SING = np.zeros(
(Repeat, n_test, OBS_DIM - 1 + 1))
information_curve_CHAI = np.zeros(
(Repeat, n_test, OBS_DIM - 1 + 1))
# history of optimal actions
action_SING = np.zeros((Repeat, n_test,
OBS_DIM - 1 ))
action_CHAI = np.zeros((Repeat, n_test,
OBS_DIM - 1))
# history of information reward values
R_hist_SING = np.zeros(
(Repeat, OBS_DIM - 1 , n_test,
OBS_DIM - 1 ))
R_hist_CHAI = np.zeros(
(Repeat, OBS_DIM - 1, n_test,
OBS_DIM - 1))
# history of posterior samples of partial inference
im_SING = np.zeros((Repeat, OBS_DIM - 1 , M,
n_test, OBS_DIM ))
im_CHAI = np.zeros((Repeat, OBS_DIM - 1, M,
n_test, OBS_DIM))
## Perform active variable selection with random ordiner and SING (single sequence)
for strategy in range(3):
if strategy == 0:### random strategy
## create arrays to store data and missingness
x = Data_test[:, :] #
x = np.reshape(x, [n_test, OBS_DIM])
mask = np.zeros((n_test, OBS_DIM))
mask[:, -1] = 0 # we will never observe target value
## initialize array that stores optimal actions (i_optimal)
i_optimal = [
nums for nums in range(OBS_DIM - 1 )
]
i_optimal = np.tile(i_optimal, [n_test, 1])
random.shuffle([random.shuffle(c) for c in i_optimal])
## evaluate likelihood at initial stage (no observation)
negative_predictive_llh, uncertainty = vae.predictive_loss(
x, mask, M)
information_curve_RAND[r, :, 0] = negative_predictive_llh
for t in range(OBS_DIM - 1 ):
print("Repeat = {:.1f}".format(r))
print("Strategy = {:.1f}".format(strategy))
print("Step = {:.1f}".format(t))
io = np.eye(OBS_DIM)[i_optimal[:, t]]
mask = mask + io
negative_predictive_llh, uncertainty = vae.predictive_loss(
x, mask, M)
information_curve_RAND[r, :, t +
1] = negative_predictive_llh
elif strategy == 1:### single ordering strategy "SING"
#SING is obtrained by maximize mean information reward for each step for the test set to be consistant with the description in the paper.
#We can also get this order by using a subset of training set to obtain the optimal ordering and apply this to the testset.
x = Data_test[:, :] #
x = np.reshape(x, [n_test, OBS_DIM])
mask = np.zeros((n_test, OBS_DIM)) # this stores the mask of missingness (stems from both test data missingness and unselected features during active learing)
mask2 = np.zeros((n_test, OBS_DIM)) # this stores the mask indicating that which features has been selected of each data
mask[:, -1] = 0 # Note that no matter how you initialize mask, we always keep the target variable (last column) unobserved.
negative_predictive_llh, uncertainty = vae.predictive_loss(
x, mask, M)
information_curve_SING[r, :, 0] = negative_predictive_llh
for t in range(OBS_DIM - 1 ): # t is a indicator of step
print("Repeat = {:.1f}".format(r))
print("Strategy = {:.1f}".format(strategy))
print("Step = {:.1f}".format(t))
## note that for single ordering, there are two rewards.
# The first one (R) is calculated based on no observations.
# This is used for active learning phase, since single ordering should not depend on observations.
# The second one (R_eval) is calculated in the same way as chain rule approximation. This is only used for visualization.
R = -1e4 * np.ones(
(n_test, OBS_DIM - 1)
)
im_0 = completion(x, mask*0, M, vae) # sample from model prior
im = completion(x, mask, M, vae) # sample conditional on observations
im_SING[r, t, :, :, :] = im
for u in range(OBS_DIM - 1): # u is the indicator for features. calculate reward function for each feature candidates
loc = np.where(mask2[:, u] == 0)[0]
if estimation_method == 0:
R[loc, u] = R_lindley_chain(u, x, mask, M, vae, im_0,loc)
else:
R[loc, u] = R_lindley_chain(u, x, mask, M, vae, im, loc)
R_hist_SING[r, t, :, :] = R
i_optimal = (R.mean(axis=0)).argmax() # optimal decision based on reward averaged on all data
i_optimal = np.tile(i_optimal, [n_test])
io = np.eye(OBS_DIM)[i_optimal]
action_SING[r, :, t] = i_optimal
mask = mask + io*mask_test # this mask takes into account both data missingness and missingness of unselected features
negative_predictive_llh, uncertainty = vae.predictive_loss(
x, mask, M)
mask2 = mask2 + io # this mask only stores missingess of unselected features, i.e., which features has been selected of each data
information_curve_SING[r, :, t +
1] = negative_predictive_llh
elif strategy == 2: ### EDDI strategy (chain rule approximation)
# personalized active feature selection strategy
## create arrays to store data and missingness
x = Data_test[:, :] #
x = np.reshape(x, [n_test, OBS_DIM])
mask = np.zeros((n_test, OBS_DIM)) # this stores the mask of missingness (stems from both test data missingness and unselected features during active learing)
mask2 = np.zeros((n_test,OBS_DIM)) # this stores the mask indicating that which features has been selected of each data
mask[:,-1] = 0 # Note that no matter how you initialize mask, we always keep the target variable (last column) unobserved.
## evaluate likelihood at initial stage (no observation)
negative_predictive_llh, uncertainty = vae.predictive_loss(
x, mask, M)
information_curve_CHAI[r, :, 0] = negative_predictive_llh
for t in range(OBS_DIM - 1): # t is a indicator of step
print("Repeat = {:.1f}".format(r))
print("Strategy = {:.1f}".format(strategy))
print("Step = {:.1f}".format(t))
R = -1e4 * np.ones((n_test, OBS_DIM - 1))
im = completion(x, mask, M, vae)
im_CHAI[r, t, :, :, :] = im
for u in range(OBS_DIM - 1): # u is the indicator for features. calculate reward function for each feature candidates
loc = np.where(mask[:, u] == 0)[0]
R[loc, u] = R_lindley_chain(u, x, mask, M, vae, im,
loc)
R_hist_CHAI[r, t, :, :] = R
i_optimal = R.argmax(axis=1)
io = np.eye(OBS_DIM)[i_optimal]
action_CHAI[r, :, t] = i_optimal
mask = mask + io # this mask takes into account both data missingness and missingness of unselected features
negative_predictive_llh, uncertainty = vae.predictive_loss(
x, mask, M)
mask2 = mask2 + io # this mask only stores missingess of unselected features, i.e., which features has been selected of each data
print(mask2[0:5, :])
information_curve_CHAI[r, :, t +
1] = negative_predictive_llh
# Save results
np.savez(
os.path.join(
args.output_dir,
'UCI_information_curve_RAND.npz'),
information_curve=information_curve_RAND)
np.savez(
os.path.join(
args.output_dir,
'UCI_information_curve_SING.npz'),
information_curve=information_curve_SING)
np.savez(
os.path.join(
args.output_dir,
'UCI_information_curve_CHAI.npz'),
information_curve=information_curve_CHAI)
np.savez(
os.path.join(args.output_dir,
'UCI_action_SING.npz'),
action=action_SING)
np.savez(
os.path.join(args.output_dir,
'UCI_action_CHAI.npz'),
action=action_CHAI)
np.savez(
os.path.join(args.output_dir,
'UCI_R_hist_SING.npz'),
R_hist=R_hist_SING)
np.savez(
os.path.join(args.output_dir,
'UCI_R_hist_CHAI.npz'),
R_hist=R_hist_CHAI)
np.savez(
os.path.join(args.output_dir,
'UCI_im_SING.npz'),
im=im_SING)
np.savez(
os.path.join(args.output_dir,
'UCI_im_CHAI.npz'),
im=im_CHAI)
return None
def train_p_vae(Data_train,mask_train, epochs, latent_dim,batch_size, p, K,iteration):
'''
This function trains the partial VAE.
:param Data_train: training Data matrix, N by D
:param mask_train: mask matrix that indicates the missingness. 1=observed, 0 = missing
:param epochs: number of epochs of training
:param LATENT_DIM: latent dimension for partial VAE model
:param p: dropout rate for creating additional missingness during training
:param K: dimension of feature map of PNP encoder
:param iteration: how many mini-batches are used each epoch. set to -1 to run the full epoch.
:return: trained VAE, together with the test data used for testing.
'''
obs_dim = Data_train.shape[1]
n_train = Data_train.shape[0]
list_train = np.arange(n_train)
####### construct
kwargs = {
'K': K,
'obs_distrib': "Gaussian",
'latent_dim': latent_dim,
'batch_size': batch_size,
'encoder': PNP_fc_uci_encoder,
'decoder': fc_uci_decoder,
'obs_dim': obs_dim,
'load_model':0,
'decoder_path': FINETUNED_DECODER_WEIGHTS,
'encoder_path': ENCODER_WEIGHTS,
}
vae = PN_Plus_VAE(**kwargs)
if iteration == -1:
n_it = int(np.ceil(n_train / kwargs['batch_size']))
else:
n_it = iteration
for epoch in range(epochs):
training_loss_full = 0.
# test_loss, test_kl, test_recon = vae.full_batch_loss(Data_test,mask_test)
# test_loss = test_loss
# test_kl = test_kl / n_test
# test_recon = test_recon / n_test
# iterate through batches
# np.random.shuffle(list_train)
for it in range(n_it):
if iteration == -1:
batch_indices = list_train[it:it + min(kwargs['batch_size'], n_train - 1)]
else:
batch_indices = sample(range(n_train), kwargs['batch_size'])
x = Data_train[batch_indices, :]
mask_train_batch = mask_train[batch_indices, :]
# DROPOUT_TRAIN = np.minimum(np.random.rand(1), p)
# while True:
# mask_drop = np.array([bernoulli.rvs(1 - DROPOUT_TRAIN, size=obs_dim)] *
# kwargs['batch_size'])
# if np.sum(mask_drop>0):
# break
DROPOUT_TRAIN = np.minimum(np.random.rand(kwargs['batch_size'], obs_dim), p)
while True:
# mask_drop = np.array([bernoulli.rvs(1 - DROPOUT_TRAIN)] )
mask_drop = bernoulli.rvs(1 - DROPOUT_TRAIN)
if np.sum(mask_drop > 0):
break
mask_drop = mask_drop.reshape([kwargs['batch_size'], obs_dim])
_ = vae.update(x, mask_drop*mask_train_batch)
loss_full, _, _ = vae.full_batch_loss(x,mask_drop*mask_train_batch)
training_loss_full += loss_full
# average loss over most recent epoch
training_loss_full /= n_it
print(
'Epoch: {} \tnegative training ELBO per observed feature: {:.2f}'
.format(epoch, training_loss_full))
vae.save_generator(FINETUNED_DECODER_WEIGHTS)
vae.save_encoder(ENCODER_WEIGHTS)
return vae

172
boston_bar_plot.py Normal file
Просмотреть файл

@ -0,0 +1,172 @@
'''
- This file demonstrates the visualization of active learning process of Boston Housing.
Please first run the active learning in main_active_learning.py
Then, to run this file:
python boston_bar_plot.py -- data_dir your_directory/data/boston/ -- output_dir your_directory/model/
- Each figure named in the format "_SING_avg_bars_x.png" (where x is the step of active learning) shows the distribution
of information reward function of each feature at this specific step on the y-axis.
All unobserved variables start with green bars, and turns purple once selected by the algorithm. . Step 0 corresponds to no features has been selected.
All reward will be updated after each decision of feature selection, and the reward of selected features will
will be fixed.
- Each figure named in the format "_SING_avg_violins_x.png" (where x is the step of active learning) shows the
violin plot of the posterior density estimations of remaining unobserved variables at this specific step
'''
import pandas as pd
import numpy as np
import argparse
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import sklearn.preprocessing as preprocessing
parser = argparse.ArgumentParser(
description='EDDI')
parser.add_argument(
'--output_dir',
type=str,
default=os.getenv('PT_OUTPUT_DIR', '/tmp'))
parser.add_argument(
'--data_dir',
type=str,
default=os.getenv('PT_DATA_DIR', 'data'),
help='Directory where UCI dataset is stored.')
args = parser.parse_args()
#### load data
data_grid = np.array([0,1,2,3,4,5,6,9,10,12])
data = 1
rs = 42
UCI = args.data_dir
Data = pd.read_excel(UCI + 'd0.xls')
Data = Data.as_matrix()
OBS_DIM = Data.shape[1]
### UCI data preprocess
if data_grid[data-1] == 12:
size_feature = Data.shape[1]
n_data = Data.shape[0]
else:
max_Data = 1 #
min_Data = 0 #
Data_std = (Data - Data.min(axis=0)) / (Data.max(axis=0) - Data.min(axis=0))
Data = Data_std * (max_Data - min_Data) + min_Data
size_feature = Data.shape[1]
n_data = Data.shape[0]
Data_train, Data_test, Data_train, Data_test = train_test_split(
Data, Data, test_size=0.1, random_state=rs)
UCI_SING = args.output_dir
npzfile = np.load(UCI_SING+'UCI_R_hist_SING.npz')
R_SING=npzfile['R_hist']
R_SING = np.maximum(R_SING,0)
npzfile = np.load(UCI_SING+'UCI_action_SING.npz')
A_SING=npzfile['action']
npzfile = np.load(UCI_SING+'UCI_im_SING.npz')
im_SING=npzfile['im']
K = 3
T=10
repeat = 0
### SING strategy over all data
pos = np.arange(R_SING.shape[1])+1
A_SING_0_t = []
A_SING_t_minus_1 = []
R_SING_t_minus_1 = []
bars1 = []
r1 = []
r2 = pos
for t in range(T):
# Create bars T=1
A_SING_t = A_SING[repeat,K,t]
R_SING_t = R_SING[repeat,t,:,:].mean(axis=0)*100
barWidth = 0.9
bars2 = R_SING_t
if t >0:
bars1 = np.append(bars1, R_SING_t_minus_1[int(A_SING_t_minus_1)])
bars2 = np.delete(bars2,A_SING_0_t.astype(int))
# The X position of bars
if t>0:
r1 = np.append(r1,int(A_SING_t_minus_1)+1)
r2 = np.delete(r2,np.argwhere(r2==int(A_SING_t_minus_1)+1))
# Create barplot
if t>0:
plt.bar(r1, bars1, width = barWidth, color = (0.3,0.1,0.4,0.6), label='Selected')
plt.bar(r2, bars2, width = barWidth, color = (0.3,0.9,0.4,0.6), label='Unselected')
# Note: the barplot could be created easily. See the barplot section for other examples.
# Create legend
plt.legend()
# Text below each barplot with a rotation at 90°
#plt.xticks([r + barWidth for r in range(R_SING.shape[1])], ['crime rate','proportion of residential land zoned','proportion of non-retail business acres',' Charles River',' nitric oxides concentration','average number of rooms','owner-occupied units built prior to 1940','distances to Boston centres','accessibility to radial highways','property-tax rate',' pupil-teacher ratio','proportion of blacks','lower status of the population'], rotation=90)
plt.xticks([r + barWidth for r in range(R_SING.shape[1])], ['CR','PRD','PNB',' CHR',' NOC','ANR','OUB','DTB','ARH','TAX',' OTR','PB','LSP',])
# Create labels
# Adjust the margins
#plt.subplots_adjust(bottom= 0.3, top = 0.98)
# Show graphic
plt.savefig(UCI_SING+'_SING_avg_bars_'+str(t)+'.png', format='png', dpi=200)
plt.show()
A_SING_t_minus_1 = A_SING_t
R_SING_t_minus_1 = R_SING_t
A_SING_0_t = np.append(A_SING_0_t,A_SING_t)
## plot voilin plot
im = im_SING[repeat,t,:,K,:]
target = Data_test[K,:]
if t >0:
im[:,(r1-1).astype(int)] = target[(r1-1).astype(int)]
M = im.shape[0]
obs_dim = im.shape[1]
GR_name = np.array(['CR','PRD','PNB',' CHR',' NOC','ANR','OUB','DTB','ARH','TAX',' OTR','PB','LSP','PRC'])
GR_label_im = np.array([])
GR_label_target = np.array([])
IM = np.reshape(im.T,(M*obs_dim,))
for i in range(M*obs_dim):
GR_label_im = np.append(GR_label_im, GR_name[int(np.floor(i/M))])
df_im = pd.DataFrame(dict(Score = IM, Group = GR_label_im))
for i in range(obs_dim):
GR_label_target = np.append(GR_label_target, GR_name[int(i)])
df_target = pd.DataFrame(dict(Score = target, Group = GR_label_target))
# library & dataset
import seaborn as sns
# Use a color palette
plt.legend()
ax1 = sns.violinplot( x=df_im["Group"], y=df_im["Score"],palette="Blues",scale="count",bw=0.5)
ax1.set_ylabel('')
ax2 = sns.pointplot(x=df_target["Group"], y=df_target["Score"], join = False,markers = 'x',scatter_kws={"s": 0.1},color = 'black')
ax2.set_ylabel('')
plt.savefig(UCI_SING+'SING_avg_violins_'+str(t)+'.png', format='png', dpi=200)
plt.show()

28
codings.py Normal file
Просмотреть файл

@ -0,0 +1,28 @@
import tensorflow as tf
from tensorflow.contrib import layers
from tensorflow.contrib.framework import arg_scope
def fc_uci_decoder(z, obs_dim, activation=tf.nn.sigmoid): #only output means since the model is N(m,sigmaI) or bernouli(m)
x = layers.fully_connected(z, 50, scope='fc-01')
x = layers.fully_connected(x, 100, scope='fc-02')
x = layers.fully_connected(x, obs_dim, activation_fn=tf.nn.sigmoid,
scope='fc-final')
return x, None
def fc_uci_encoder(x, latent_dim, activation=None):
e = layers.fully_connected(x, 100, scope='fc-01')
e = layers.fully_connected(e, 50, scope='fc-02')
e = layers.fully_connected(e, 2 * latent_dim, activation_fn=activation,
scope='fc-final')
return e
def PNP_fc_uci_encoder(x, K, activation=None):
e = layers.fully_connected(x, 100, scope='fc-01')
e = layers.fully_connected(e, 50, scope='fc-02')
e = layers.fully_connected(e, K, scope='fc-final')
return e

113
main_active_learning.py Normal file
Просмотреть файл

@ -0,0 +1,113 @@
'''
EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE
This code implements EDDI active learning strategy, together with a global single ordering strategy
based on partial VAE (PNP), demonstrated on a UCI dataset.
To run this code:
python main_active_learning.py --epochs 3000 --latent_dim 10 --p 0.99 --data_dir your_directory/data/boston --output_dir your_directory/model
possible arguments:
- epochs: number of epochs.
- latent_dim: size of latent space of partial VAE.
- p: upper bound for artificial missingness probability. For example, if set to 0.9, then during each training epoch, the algorithm will
randomly choose a probability smaller than 0.9, and randomly drops observations according to this probability.
Our suggestion is that if original dataset already contains missing data, you can just set p to 0.
- batch_size: mini batch size for training. default: 100
- iteration: iterations (number of minibatches) used per epoch. set to -1 to run the full epoch.
If your dataset is large, please set this to other values such as 10.
- K: dimension of the feature map (h) dimension of PNP encoder. Default: 20
- M: Number of MC samples when perform imputing. Default: 50
- repeat: Number of repeats of the active learning experiment
- data_dir: Directory where UCI dataset is stored.
- output_dir: Directory where the trained model will be stored and loaded.
Other comments:
- We assume that the data is stored in an excel file named d0.xls,
and we assume that the last column is the target variable of interest (only used in active learning)
you may need to modify the load data section according to your task.
- Note that this code assumes a Gaussian noise real valued data. You may need to modify the likelihood function for other types of data.
- In preprocessing, we chose to squash the data to the range of 0 and 1. Therefore our decoder output has also been squashed
by a sigmoid function. If you wish to change the preprocessing setting, you may also need to change the decoder setting accordingly.
This can be found in coding.py.
File Structure:
- main functions:
main_train_impute.py: implements the training of partial VAE (PNP) part demonstrated on a UCI dataset.
main_active_learning.py: implements a global single ordering strategy based on partial VAE demonstrated on a UCI dataset
it will also generate a information curve plot.
- decoder-encoder functions: coding.py
- partial VAE class:p_vae.py
- training-impute functions: train_and_test_functions.py
- training-active learning functions:active_learning_functions.py
- active learning visualization: boston_bar_plot.py, this will visualize the decision process of eddi on Boston Housing data.
- data: data/boston/d0.xls
'''
### load models and functions
from active_learning_functions_dropfix import *
#### Import libraries
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import pandas as pd
import sklearn.preprocessing as preprocessing
plt.switch_backend('agg')
tfd = tf.contrib.distributions
### load data
Data = pd.read_excel(UCI + '/d0.xls')
Data = Data.as_matrix()
### data preprocess
max_Data = 1 #
min_Data = 0 #
Data_std = (Data - Data.min(axis=0)) / (Data.max(axis=0) - Data.min(axis=0))
Data = Data_std * (max_Data - min_Data) + min_Data
Mask = np.ones(Data.shape) # this UCI data is fully observed
Data_train, Data_test, mask_train, mask_test = train_test_split(
Data, Mask, test_size=0.1, random_state=rs)
### run training and active learning
# #number of experiments that you want to repeat
Repeat = args.repeat
#Training Partial VAE and then apply Random order feature selection (RAND in the paper) and single order feature selection (SING in the paper)
#generate information curve and per step information gain with RAND and SING.
# p_vae_active_learning(Data_train,mask_train,Data_test,mask_test,args.epochs,args.latent_dim,args.batch_size,args.p,args.K,args.M,Repeat)
### visualize active learning
npzfile = np.load(args.output_dir+'/UCI_information_curve_RAND.npz')
IC_RAND=npzfile['information_curve']
npzfile = np.load(args.output_dir+'/UCI_information_curve_SING.npz')
IC_SING=npzfile['information_curve']
npzfile = np.load(args.output_dir+'/UCI_information_curve_CHAI.npz')
IC_CHAI=npzfile['information_curve']
import matplotlib.pyplot as plt
plt.figure(0)
L = IC_SING.shape[1]
fig, ax1 = plt.subplots()
# These are in unitless percentages of the figure size. (0,0 is bottom left)
left, bottom, width, height = [0.45, 0.4, 0.45, 0.45]
ax1.plot((IC_RAND[:,:,0:].mean(axis=1)).mean(axis = 0),'gs',linestyle = '-.', label = 'PNP+RAND')
ax1.errorbar(np.arange(IC_RAND.shape[2]), (IC_RAND[:,:,0:].mean(axis=1)).mean(axis = 0), yerr=(IC_RAND[:,:,0:].mean(axis=1)).std(axis = 0)/np.sqrt(IC_SING.shape[0]),ecolor='g',fmt = 'gs')
ax1.plot((IC_SING[:,:,0:].mean(axis=1)).mean(axis = 0),'ms',linestyle = '-.', label = 'PNP+SING')
ax1.errorbar(np.arange(IC_SING.shape[2]), (IC_SING[:,:L,0:].mean(axis=1)).mean(axis = 0), yerr=(IC_SING[:,0:L,0:].mean(axis=1)).std(axis = 0)/np.sqrt(IC_SING.shape[0]),ecolor='m',fmt = 'ms')
ax1.plot((IC_CHAI[:,:,0:].mean(axis=1)).mean(axis = 0),'ks',linestyle = '-.', label = 'PNP+PERSONAL')
ax1.errorbar(np.arange(IC_CHAI.shape[2]), (IC_CHAI[:,:L,0:].mean(axis=1)).mean(axis = 0), yerr=(IC_CHAI[:,0:L,0:].mean(axis=1)).std(axis = 0)/np.sqrt(IC_CHAI.shape[0]),ecolor='k',fmt = 'ks')
plt.xlabel('Steps',fontsize=18)
plt.ylabel('avg. neg. test likelihood',fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
ax1.legend(bbox_to_anchor=(0.0, 1.02, 1., .102), mode = "expand", loc=3,
ncol=1, borderaxespad=0.,prop={'size': 20}, frameon=False)
plt.show()
plt.savefig(args.output_dir+'/PNP_all_IC_curves.eps', format='eps', dpi=200,bbox_inches='tight')

96
main_train_and_impute.py Normal file
Просмотреть файл

@ -0,0 +1,96 @@
'''
EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE
This code implememts partial VAE (PNP) part demonstrated on a UCI dataset.
To run this code:
python main_train_and_impute.py --epochs 3000 --latent_dim 10 --p 0.99 --data_dir your_directory/data/boston --output_dir your_directory/model
possible arguments:
- epochs: number of epochs.
- latent_dim: size of latent space of partial VAE.
- p: upper bound for artificial missingness probability. For example, if set to 0.9, then during each training epoch, the algorithm will
randomly choose a probability smaller than 0.9, and randomly drops observations according to this probability.
Our suggestion is that if original dataset already contains missing data, you can just set p to 0.
- batch_size: mini batch size for training. default: 100
- iteration: iterations (number of minibatches) used per epoch. set to -1 to run the full epoch.
If your dataset is large, please set to other values such as 10.
- K: dimension of the feature map (h) dimension of PNP encoder. Default: 20
- M: Number of MC samples when perform imputing. Default: 50
- data_dir: Directory where UCI dataset is stored.
- output_dir: Directory where the trained model will be stored and loaded.
Other comments:
- We assume that the data is stored in an excel file named d0.xls,
and we assume that the last column is the target variable of interest (only used in active learning)
you should modify the load data section according to your data.
- Note that this code assumes a Gaussian noise real valued data. You may need to modify the likelihood function for other types of data.
- In preprocessing, we chose to squash the data to the range of 0 and 1. Therefore our decoder output has also been squashed
by a sigmoid function. If you wish to change the preprocessing setting, you may also need to change the decoder setting accordingly.
This can be found in coding.py.
File Structure:
- main functions:
main_train_impute.py: implements the training of partial VAE (PNP) part demonstrated on a UCI dataset.
main_active_learning.py: implements the EDDI active learning strategy, together with a global single ordering strategy based on partial VAE demonstrated on a UCI dataset
it will also generate a information curve plot.
- decoder-encoder functions: coding.py
- partial VAE class:p_vae.py
- training-impute functions: train_and_test_functions.py
- training-active learning functions:active_learning_functions.py
- active learning visualization: boston_bar_plot.py, this will visualize the decision process of eddi on Boston Housing data.
- data: data/boston/d0.xls
'''
### load models and functions
from train_and_test_functions_dropfix import *
#### Import libraries
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import random
from random import sample
import pandas as pd
import sklearn.preprocessing as preprocessing
plt.switch_backend('agg')
tfd = tf.contrib.distributions
### load data
Data = pd.read_excel(UCI + '/d0.xls')
Data = Data.as_matrix()
### data preprocess
max_Data = 1 #
min_Data = 0 #
Data_std = (Data - Data.min(axis=0)) / (Data.max(axis=0) - Data.min(axis=0))
Data = Data_std * (max_Data - min_Data) + min_Data
Mask = np.ones(Data.shape) # This is a mask indicating missingness, 1 = observed, 0 = missing.
# this UCI data is fully observed. you should modify the set up of Mask if your data contains missing data.
### split the data into train and test sets
Data_train, Data_test, mask_train, mask_test = train_test_split(
Data, Mask, test_size=0.1, random_state=rs)
### Train the model and save the trained model.
vae = train_p_vae(Data_train,mask_train, args.epochs, args.latent_dim, args.batch_size,args.p, args.K,args.iteration)
### Test imputating model on the test set
## Calculate test ELBO of observed test data (will load the pre-trained model). Note that this is NOT imputing.
tf.reset_default_graph()
test_loss = test_p_vae_marginal_elbo(Data_test,mask_test, args.latent_dim,args.K)
## Calculate imputation RMSE (conditioned on observed data. will load the pre-trained model)
## Note that here we perform imputation on a new dataset, whose observed entries are not used in training.
## this will under estimate the imputation performance, since in principle all observed entries should be used to train the model.
tf.reset_default_graph()
Data_ground_truth = Data_test
mask_obs = np.array([bernoulli.rvs(1 - 0.3, size=Data_ground_truth.shape[1]*Data_ground_truth.shape[0])]) # manually create missing data on test set
mask_obs = mask_obs.reshape(Data_ground_truth.shape)
Data_observed = Data_ground_truth*mask_obs
mask_target = 1-mask_obs
# This line below is optional. Turn on this line means that we use the new comming testset to continue update the imputing model. Turn off this linea means that we only use the pre-trained model to impute without futher updating the model.
# vae = train_p_vae(Data_ground_truth,mask_obs, args.epochs, args.latent_dim, args.batch_size,0, args.K,args.iteration)
tf.reset_default_graph()
RMSE = impute_p_vae(Data_observed,mask_obs,Data_ground_truth,mask_target,args.latent_dim,args.batch_size,args.K,args.M)

519
p_vae.py Normal file
Просмотреть файл

@ -0,0 +1,519 @@
import tensorflow as tf
import numpy as np
from tensorflow.contrib import layers
from tensorflow.contrib.distributions import Normal
import re
import copy
class PN_Plus_VAE(object):
def __init__(self,
encoder,
decoder,
obs_dim,
decoder_path,
encoder_path,
learning_rate=1e-3,
optimizer=tf.train.AdamOptimizer,
obs_distrib="Gaussian",
obs_std=0.1*np.sqrt(2),
K = 20,
latent_dim = 10,
batch_size = 100,
load_model=0,
M=5,
all=1):
'''
:param encoder: type of encoder model choosen from coding.py
:param decoder: type of decoder model choosen from coding.py
:param obs_dim: maximum number of partial observational dimensions
:param encoder_path: path for saving encoder model parameter
:param decoder_path: path for saving decoder model parameter
:param learning_rate: optimizer learning rate
:param optimizer: we use Adam here.
:param obs_distrib: Bernoulli or Gaussian.
:param obs_std: observational noise for decoder.
:param K: length of code for summarizing partial observations
:param latent_dim: latent dimension of VAE
:param batch_size: training batch size
:param load_model: 1 = load a pre-trained model from decoder_path and encoder_path
:param M : number of MC samples used when performing imputing/prediction
'''
self._K = K
self._latent_dim = latent_dim
self._batch_size = batch_size
self._encode = encoder
self._decode = decoder
self._obs_dim = obs_dim
self._learning_rate = learning_rate
self._optimizer = optimizer
self._obs_distrib = obs_distrib
self._obs_std = obs_std
self._load_model = load_model
self._all = all
self._decoder_path = decoder_path
self._encoder_path = encoder_path
self._M = M
self._build_graph()
## build partial VAE
def _build_graph(self):
with tf.variable_scope('is'):
# placeholder for UCI inputs
self.x = tf.placeholder(tf.float32, shape=[None, self._obs_dim])
self.x_flat = tf.reshape(self.x, [-1, 1])
# placeholder for masks
self.mask = tf.placeholder(tf.float32, shape=[None, self._obs_dim])
self._batch_size = tf.shape(self.x)[0]
# encode inputs (map to parameterization of diagonal Gaussian)
with tf.variable_scope('encoder'):
# the tensor F stores ID matrix
self.F = tf.get_variable(
"F",
shape=[1, self._obs_dim, 10],
initializer=tf.contrib.layers.xavier_initializer())
self.F = tf.tile(self.F, [self._batch_size, 1, 1])
self.F = tf.reshape(self.F, [-1, 10])
self.b = tf.get_variable(
"b",
shape=[1, self._obs_dim, 1],
initializer=tf.contrib.layers.xavier_initializer())
# bias vector
self.b = tf.tile(self.b, [self._batch_size, 1, 1])
self.b = tf.reshape(self.b, [-1, 1])
self.x_aug = tf.concat(
[self.x_flat, self.x_flat * self.F, self.b], 1)
self.encoded = layers.fully_connected(self.x_aug, self._K)
self.encoded = tf.reshape(self.encoded,
[-1, self._obs_dim, self._K])
self.mask_on_hidden = tf.reshape(self.mask,
[-1, self._obs_dim, 1])
self.mask_on_hidden = tf.tile(self.mask_on_hidden,
[1, 1, self._K])
self.encoded = tf.nn.relu(
tf.reduce_sum(self.encoded * self.mask_on_hidden, 1))
self.encoded = layers.fully_connected(self.encoded, 500)
self.encoded = layers.fully_connected(self.encoded, 200)
self.encoded = layers.fully_connected(
self.encoded, 2 * self._latent_dim, activation_fn=None)
with tf.variable_scope('sampling'):
# unpacking mean and (diagonal) variance of latent variable
self.mean = self.encoded[:, :self._latent_dim]
self.logvar = self.encoded[:, self._latent_dim:]
# also calculate standard deviation for practical use
self.stddev = tf.sqrt(tf.exp(self.logvar))
# sample from latent space
epsilon = tf.random_normal(
[self._batch_size, self._latent_dim])
self.z = self.mean + self.stddev * epsilon
# decode batch
with tf.variable_scope('generator'):
self.decoded, _ = self._decode(self.z, self._obs_dim)
with tf.variable_scope('loss'):
# KL divergence between approximate posterior q and prior p
with tf.variable_scope('kl-divergence'):
self.kl = self._kl_diagnormal_stdnormal(
self.mean, self.logvar)
# loss likelihood
if self._obs_distrib == 'Bernoulli':
with tf.variable_scope('bernoulli'):
self.log_like = self._bernoulli_log_likelihood(
self.x, self.decoded, self.mask)
else:
with tf.variable_scope('gaussian'):
self.log_like = self._gaussian_log_likelihood(
self.x * self.mask, self.decoded * self.mask,
self._obs_std)
self._loss = (self.kl + self.log_like) / tf.cast(
self._batch_size, tf.float32) # loss per instance (actual loss used)
self._loss_print = (self.kl + self.log_like) / tf.reduce_sum(
self.mask) # loss per feature, for tracking training process only
with tf.variable_scope('optimizer'):
optimizer = self._optimizer(learning_rate=self._learning_rate)
with tf.variable_scope('training-step'):
self._train = optimizer.minimize(self._loss)
if self._load_model == 1:
generator_variables = []
for v in tf.trainable_variables():
if "generator" in v.name:
generator_variables.append(v)
encoder_variables = []
for v in tf.trainable_variables():
if "encoder" in v.name:
encoder_variables.append(v)
# start tensorflow session
self._sesh = tf.Session()
load_encoder = tf.contrib.framework.assign_from_checkpoint_fn(
self._encoder_path, encoder_variables)
load_encoder(self._sesh)
load_generator = tf.contrib.framework.assign_from_checkpoint_fn(
self._decoder_path, generator_variables)
load_generator(self._sesh)
uninitialized_vars = []
for var in tf.all_variables():
try:
self._sesh.run(var)
except tf.errors.FailedPreconditionError:
uninitialized_vars.append(var)
init_new_vars_op = tf.variables_initializer(uninitialized_vars)
self._sesh.run(init_new_vars_op)
else:
self._sesh = tf.Session()
init = tf.global_variables_initializer()
self._sesh.run(init)
## KL divergence
def _kl_diagnormal_stdnormal(self, mu, log_var):
'''
This function calculates KL divergence
:param mu: mean
:param log_var: log variance
:return:
'''
var = tf.exp(log_var)
kl = 0.5 * tf.reduce_sum(tf.square(mu) + var - 1. - log_var)
return kl
## likelihood terms
@staticmethod
def _bernoulli_log_likelihood(targets, outputs, mask, eps=1e-8):
'''
This function comptutes negative log likelihood for Bernoulli likelihood
:param targets: test data
:param outputs: model predictions
:param mask: mask of missingness
:return: negative log llh
'''
eps = 0
log_like = -tf.reduce_sum(targets * (tf.log(outputs + eps) * mask) +
(1. - targets) *
(tf.log((1. - outputs) + eps) * mask))
return log_like
@staticmethod
def _gaussian_log_likelihood(targets, mean, std):
'''
This function computes negative log likelihood for Gaussians during training
Note that constant terms are dropped.
:param targets: test data
:param mean: mean
:param std: sigma
:return: negative log llh
'''
se = tf.reduce_sum(
0.5 * tf.square(targets - mean) / tf.cast(tf.square(std), tf.float32) + tf.cast(tf.log(std), tf.float32))
return se
## optimization function
def update(self, x, mask):
'''
This function is used to update the model during training/optimization
:param x: training data
:param mask: mask that indicates observed data and missing data locations
'''
_, loss = self._sesh.run([self._train, self._loss_print],
feed_dict={
self.x: x,
self.mask: mask
})
return loss
##
def full_batch_loss(self, x,mask):
'''
retrieve different components of loss function
:param x: dat matrix
:param mask: mask that indicates observed data and missing data locations
:return: overall loss (averaged over all entries), KL term, and reconstruction loss
'''
# mask = np.ones((x.shape[0], self._obs_dim))
loss, kl, recon = self._sesh.run(
[self._loss_print, self.kl, self.log_like],
feed_dict={
self.x: x,
self.mask: mask
})
return loss, kl, recon
## predictive likelihood and uncertainties
def predictive_loss(self, x, mask, M):
'''
This function computes predictive losses (negative llh).
This is used for active learning phase.
We assume that the last column of x is the target variable of interest
:param x: data matrix, the last column of x is the target variable of interest
:param mask: mask that indicates observed data and missing data locations
:return: MAE and RMSE
'''
llh = 0
lh = 0
uncertainty_data = np.zeros((x.shape[0], M))
for m in range(M):
decoded = self._sesh.run(
self.decoded, feed_dict={
self.x: x,
self.mask: mask
})
target = x[:, -1]
output = decoded[:, -1]
uncertainty_data[:, m] = decoded[:, -1]
if self._obs_distrib == 'Bernoulli':
llh += target * (np.log(output + 1e-8)) + (1. - target) * (
np.log((1. - output) + 1e-8))
else:
lh += np.exp(-0.5 * np.square(target - output) / (
np.square(self._obs_std)) - np.log(self._obs_std) - 0.5 * np.log(2 * np.pi))
uncertainty = uncertainty_data.std(axis=1)
if self._obs_distrib == 'Bernoulli':
nllh = -llh / M
else:
nllh = -np.log(lh / M)
return nllh, uncertainty
def impute_losses(self, x, mask_obs, mask_target):
'''
This function computes imputation losses
:param x: data matrix
:param mask_obs: mask that indicates observed data and missing data locations
:param mask_target: mask that indicates the test data locations
:return: MAE and RMSE
'''
MAE = 0
RMSE = 0
for m in range(self._M):
decoded = self._sesh.run(self.decoded,
feed_dict={self.x: x, self.mask: mask_obs})
target = x * mask_target
output = decoded * mask_target
MAE += np.sum(np.abs(target - output)) / np.sum(mask_target)
RMSE += np.sqrt(np.sum(np.square(target - output)) / np.sum(mask_target))
MAE = MAE / self._M
RMSE = RMSE / self._M
return MAE, RMSE
## generate partial inference samples
def im(self, x, mask):
'''
This function produces simulations of unobserved variables based on observed ones.
:param x: data matrix
:param mask: mask that indicates observed data and missing data locations
:return: im, which contains samples of completion.
'''
m, v = self._sesh.run([self.mean, self.stddev],
feed_dict={
self.x: x,
self.mask: mask
})
ep = np.random.normal(0, 1, [x.shape[0], self._latent_dim])
z = m + v * ep
im = self._sesh.run(self.decoded, feed_dict={self.z: z})
return im
## calculate the first term of information reward approximation
def chaini_I(self, x, mask, i):
'''
calculate the first term of information reward approximation
used only in active learning phase
:param x: data
:param mask: mask of missingness
:param i: indicates the index of x_i
:return: the first term of information reward approximation
'''
temp_mask = copy.deepcopy(mask)
m, v = self._sesh.run([self.mean, self.stddev],
feed_dict={
self.x: x,
self.mask: temp_mask
})
var = v**2
log_var = 2 * np.log(v)
temp_mask[:, i] = 1
m_i, v_i = self._sesh.run([self.mean, self.stddev],
feed_dict={
self.x: x,
self.mask: temp_mask
})
var_i = v_i**2
log_var_i = 2 * np.log(v_i)
kl_i = 0.5 * np.sum(
np.square(m_i - m) / v + var_i / var - 1. - log_var_i + log_var,
axis=1)
return kl_i
## calculate the second term of information reward approximation
def chaini_II(self, x, mask, i):
'''
calculate the second term of information reward approximation
used only in active learning phase
Note that we assume that the last column of x is the target variable of interest
:param x: data
:param mask: mask of missingness
:param i: indicates the index of x_i
:return: the second term of information reward approximation
'''
# mask: represents x_o
# targets: 0 by M vector, contains M samples from p(\phi|x_o)
# x : 1 by obs_dim vector, contains 1 instance of data
# i: indicates the index of x_i
temp_mask = copy.deepcopy(mask)
temp_mask[:, -1] = 1
m, v = self._sesh.run([self.mean, self.stddev],
feed_dict={
self.x: x,
self.mask: temp_mask
})
var = v**2
log_var = 2 * np.log(v)
temp_mask[:, i] = 1
m_i, v_i = self._sesh.run([self.mean, self.stddev],
feed_dict={
self.x: x,
self.mask: temp_mask
})
var_i = v_i**2
log_var_i = 2 * np.log(v_i)
kl_i = 0.5 * np.sum(
np.square(m_i - m) / v + var_i / var - 1. - log_var_i + log_var,
axis=1)
return kl_i
## save model
def save_generator(self, path, prefix="is/generator"):
'''
This function saves generator parameters to path
'''
variables = tf.trainable_variables()
var_dict = {}
for v in variables:
if "generator" in v.name:
name = prefix + re.sub("is/generator", "", v.name)
name = re.sub(":0", "", name)
var_dict[name] = v
for k, v in var_dict.items():
print(k)
print(v)
saver = tf.train.Saver(var_dict)
saver.save(self._sesh, path)
def save_encoder(self, path, prefix="is/encoder"):
'''
This function saves encoder parameters to path
'''
variables = tf.trainable_variables()
var_dict = {}
for v in variables:
if "encoder" in v.name:
name = prefix + re.sub("is/encoder", "", v.name)
name = re.sub(":0", "", name)
var_dict[name] = v
for k, v in var_dict.items():
print(k)
print(v)
saver = tf.train.Saver(var_dict)
saver.save(self._sesh, path)
### function to generate new samples conditioned on observations
def completion(x, mask, M, vae):
'''
function to generate new samples conditioned on observations
:param x: underlying partially observed data
:param mask: mask of missingness
:param M: number of MC samples
:param vae: a pre-trained vae
:return: sampled missing data, a M by N by D matrix, where M is the number of samples.
'''
im = np.zeros((M, x.shape[0], x.shape[1]))
for m in range(M):
#tf.reset_default_graph()
np.random.seed(42 + m) ### added for bar plots only
im[m, :, :] = vae.im(x, mask)
return im
### function for computing reward function approximation
def R_lindley_chain(i, x, mask, M, vae, im, loc):
'''
function for computing reward function approximation
:param i: indicates the index of x_i
:param x: data matrix
:param mask: mask of missingness
:param M: number of MC samples
:param vae: a pre-trained vae
:param im: sampled missing data, a M by N by D matrix, where M is the number of samples.
:return:
'''
im_i = im[:, :, i]
#print(im_i)
approx_KL = 0
im_target = im[:, :, -1]
temp_x = copy.deepcopy(x)
for m in range(M):
temp_x[loc, i] = im_i[m, loc]
KL_I = vae.chaini_I(temp_x[loc, :], mask[loc, :], i)
temp_x[loc, -1] = im_target[m, loc]
KL_II = vae.chaini_II(temp_x[loc, :], mask[loc, :], i)
approx_KL += KL_I
approx_KL -= KL_II
R = approx_KL / M
return R

245
train_and_test_functions.py Normal file
Просмотреть файл

@ -0,0 +1,245 @@
from p_vae import *
from codings import *
import numpy as np
import tensorflow as tf
from scipy.stats import bernoulli
import argparse
import os
#### parser configurations
parser = argparse.ArgumentParser(
description='EDDI')
parser.add_argument(
'--epochs',
type=int,
default=3000,
metavar='N_eps',
help='number of epochs to train (default: 3000)')
parser.add_argument(
'--latent_dim',
type=int,
default=10,
metavar='LD',
help='latent dimension (default: 10)')
parser.add_argument(
'--p',
type=float,
default=0.7,
metavar='probability',
help='dropout probability of artificial missingness during training')
parser.add_argument(
'--iteration',
type=int,
default=-1,
metavar='it',
help='iterations per epoch. set to -1 to run the full epoch. ')
parser.add_argument(
'--batch_size',
type=int,
default=100,
metavar='batch',
help='Mini Batch size per epoch. ')
parser.add_argument(
'--K',
type=int,
default=20,
metavar='K',
help='Dimension of PNP feature map ')
parser.add_argument(
'--M',
type=int,
default=50,
metavar='M',
help='Number of MC samples when perform imputing')
parser.add_argument(
'--output_dir',
type=str,
default=os.getenv('PT_OUTPUT_DIR', '/tmp'))
parser.add_argument(
'--data_dir',
type=str,
default=os.getenv('PT_DATA_DIR', 'data'),
help='Directory where UCI dataset is stored.')
args = parser.parse_args()
#### Set directories
UCI = args.data_dir
ENCODER_WEIGHTS = os.path.join(args.output_dir, 'encoder.tensorflow')
FINETUNED_DECODER_WEIGHTS = os.path.join(args.output_dir, 'generator.tensorflow')
rs = 42 # random seed
def train_p_vae(Data_train,mask_train, epochs, latent_dim,batch_size, p, K,iteration):
'''
This function trains the partial VAE.
:param Data_train: training Data matrix, N by D
:param mask_train: mask matrix that indicates the missingness. 1=observed, 0 = missing
:param epochs: number of epochs of training
:param LATENT_DIM: latent dimension for partial VAE model
:param p: dropout rate for creating additional missingness during training
:param K: dimension of feature map of PNP encoder
:param iteration: how many mini-batches are used each epoch. set to -1 to run the full epoch.
:return: trained VAE, together with the test data used for testing.
'''
obs_dim = Data_train.shape[1]
n_train = Data_train.shape[0]
list_train = np.arange(n_train)
####### construct
kwargs = {
'K': K,
'obs_distrib': "Gaussian",
'latent_dim': latent_dim,
'batch_size': batch_size,
'encoder': PNP_fc_uci_encoder,
'decoder': fc_uci_decoder,
'obs_dim': obs_dim,
'load_model':0,
'decoder_path': FINETUNED_DECODER_WEIGHTS,
'encoder_path': ENCODER_WEIGHTS,
}
vae = PN_Plus_VAE(**kwargs)
if iteration == -1:
n_it = int(np.ceil(n_train / kwargs['batch_size']))
else:
n_it = iteration
for epoch in range(epochs):
training_loss_full = 0.
# test_loss, test_kl, test_recon = vae.full_batch_loss(Data_test,mask_test)
# test_loss = test_loss
# test_kl = test_kl / n_test
# test_recon = test_recon / n_test
# iterate through batches
# np.random.shuffle(list_train)
for it in range(n_it):
if iteration == -1:
batch_indices = list_train[it:it + min(kwargs['batch_size'], n_train - 1)]
else:
batch_indices = sample(range(n_train), kwargs['batch_size'])
x = Data_train[batch_indices, :]
mask_train_batch = mask_train[batch_indices, :]
DROPOUT_TRAIN = np.minimum(np.random.rand(kwargs['batch_size'],obs_dim), p)
while True:
# mask_drop = np.array([bernoulli.rvs(1 - DROPOUT_TRAIN)] )
mask_drop = bernoulli.rvs(1 - DROPOUT_TRAIN)
if np.sum(mask_drop>0):
break
# mask_drop = mask_drop.reshape([kwargs['batch_size'], obs_dim])
_ = vae.update(x, mask_drop*mask_train_batch)
loss_full, _, _ = vae.full_batch_loss(x,mask_drop*mask_train_batch)
training_loss_full += loss_full
# average loss over most recent epoch
training_loss_full /= n_it
print(
'Epoch: {} \tnegative training ELBO per observed feature: {:.2f}'
.format(epoch, training_loss_full))
vae.save_generator(FINETUNED_DECODER_WEIGHTS)
vae.save_encoder(ENCODER_WEIGHTS)
return vae
def test_p_vae_marginal_elbo(Data_test,mask_test,latent_dim,K):
'''
This function computes the marginal (negative) ELBO of observed test data
Note that this function does not perform imputing.
:param Data_test: test data matrix
:param mask_test: mask matrix that indicates the missingness. 1=observed, 0 = missing
:param latent_dim: latent dimension for partial VAE model
:param K:dimension of feature map of PNP encoder
:return: test negative ELBO
'''
obs_dim = Data_test.shape[1]
####### construct
kwargs = {
'K': K,
'obs_distrib': "Gaussian",
'latent_dim': latent_dim,
'encoder': PNP_fc_uci_encoder,
'decoder': fc_uci_decoder,
'obs_dim': obs_dim,
'load_model':1,
'decoder_path': FINETUNED_DECODER_WEIGHTS,
'encoder_path': ENCODER_WEIGHTS,
}
vae = PN_Plus_VAE(**kwargs)
test_loss, _, _ = vae.full_batch_loss(Data_test,mask_test)
print('test negative ELBO per feature: {:.2f}'
.format(test_loss))
return test_loss
def impute_p_vae(Data_observed,mask_observed,Data_ground_truth,mask_target,latent_dim,batch_size,K,M):
'''
This function loads a pretrained p-VAE model, and performs imputation and returns RMSE.
:param Data_observed: observed data matrix
:param mask_observed: mask matrix that indicates the missingness of observed data. 1=observed, 0 = missing
:param Data_ground_truth: ground truth data for calculating RMSE performance. It can also contain missing data.
:param mask_target: ask matrix that indicates the missingness of ground truth data. 1=observed, 0 = missing
:param latent_dim: latent dimension for partial VAE model
:param batch_size: Mini-batch size. We evaluate test RMSE in batch mode in order to handle large dataset.
:param K: dimension of feature map of PNP encoder
:param M: number of samples used for MC sampling
:return: RMSE
'''
obs_dim = Data_observed.shape[1]
####### construct
n_data = Data_observed.shape[0]
# mask_test_obs = 1 * (Data_observed != missing_value_indicator) # mask of test observed entries
# mask_test_target = 1 * (Data_ground_truth != missing_value_indicator) # mask of test missingness to be imputed
mask_test_obs = mask_observed
mask_test_target = mask_target
kwargs = {
'K': K,
'obs_distrib': "Gaussian",
'latent_dim': latent_dim,
'batch_size': batch_size,
'encoder': PNP_fc_uci_encoder,
'decoder': fc_uci_decoder,
'obs_dim': obs_dim,
'load_model': 1,
'decoder_path': FINETUNED_DECODER_WEIGHTS,
'encoder_path': ENCODER_WEIGHTS,
'M': M,
}
vae = PN_Plus_VAE(**kwargs)
impute_loss_RMSE = 0.
list_data = np.arange(n_data)
# np.random.shuffle(list_data)
n_it = int(np.ceil(n_data / kwargs['batch_size']))
# iterate through batches
for it in range(n_it):
batch_indices = list_data[it:it + min(kwargs['batch_size'], n_data - 1)]
_, impute_loss_RMSE_batch = vae.impute_losses(Data_ground_truth[batch_indices, :], mask_test_obs[batch_indices, :],
mask_test_target[batch_indices, :])
impute_loss_RMSE += impute_loss_RMSE_batch
impute_loss_RMSE /= (n_it)
print('test impute RMSE: {:.2f}'
.format(impute_loss_RMSE))
return impute_loss_RMSE