Add files via upload

fixed random mask problem; added EDDI strategy
2019-04-03 16:59:27 +01:00 · 2019-04-03 16:59:27 +01:00 · c1f4cb19b8
--- a/README.txt
+++ b/README.txt
@ -0,0 +1,81 @@
+EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE (https://arxiv.org/pdf/1809.11142.pdf) 
+
+For the ease of evaluation, we provide two versions of the code.
+The first version only contains the missing data imputation model using Partial VAE. 
+The second version presents both training and partial VAE and use the single optimal order strategy for feature selection (SING in the paper). 
+
+*****Version One: Missing Data Imputation*****
+This code implements partial VAE (PNP) part demonstrated on a UCI dataset.
+
+To run this code:
+python main_train_and_impute.py  --epochs 3000  --latent_dim 10 --p 0.99 --data_dir your_directory/data/boston/ --output_dir your_directory/model/
+
+Input: 
+In this example, we use the Boston house dataset. (https://archive.ics.uci.edu/ml/machine-learning-databases/housing/)
+We split the dataset into train and test set in our code. 
+Random test set entries are removed for imputation quality evaluation. 
+
+Output:
+This code will save the trained model to your_directory/model. To impute new data (test dataset in this case), we can load the pre-trained model. 
+In case of missing data in the training set, you can use the same train set as a test set to obtain the imputing results. 
+The code also computes the test RMSE for imputed missing data and the result is printed on the console. 
+
+*****Version Two: Single Optimal Feature Ordering*****
+This code implements partial VAE (PNP) + SING together demonstrated on a UCI dataset.
+
+To run this code:
+
+python main_active_learning.py  --epochs 3000  --latent_dim 10 --p 0.99 --data_dir your_directory/data/boston/ --output_dir your_directory/model/
+
+Input: 
+In this example, we use the Boston house dataset. (https://archive.ics.uci.edu/ml/machine-learning-databases/housing/)
+We split the dataset into train and test set in our code. 
+Random test set entries are removed for imputation quality evaluation. 
+
+
+Output:
+This code will run Partial VAE on the training set and use the trained model for the sequential feature selection as we presented in the paper. 
+This code will output the result comparing RAND and SING in the paper in the form of information curve as presented in the paper. 
+This code will also output the information gain for each feature selection step. These are stored in your_directory/model.
+
+
+******* Possible  Arguments ****
+- epochs: number of epochs.
+- latent_dim: size of latent space of partial VAE.
+- p: upper bound for artificial missingness probability. For example, if set to 0.9, then during each training epoch, the algorithm will
+  randomly choose a probability smaller than 0.9, and randomly drops observations according to this probability.
+  Our suggestion is that if original dataset already contains missing data, you can just set p to 0.
+- batch_size: minibatch size for training. default: 100
+- iteration: iterations (number of mini batches) used per epoch. set to -1 to run the full epoch.
+  If your dataset is large, please set to other values such as 10.
+- K: the dimension of the feature map (h) dimension of PNP encoder. Default: 20
+- M: Number of MC samples when perform imputing. Default: 50
+- data_dir: Directory where UCI dataset is stored.
+- output_dir: Directory where the trained model will be stored and loaded.
+
+Other comments:
+- We assume that the data is stored in an excel file named d0.xls,
+   and we assume that the last column is the target variable of interest (only used in active learning)
+   you should modify the load data section according to your data.
+- Note that this code assumes a Gaussian noise real-valued data. You may need to modify the likelihood function for other types of data.
+- In preprocessing, we chose to squash the data to the range of 0 and 1. Therefore our decoder output has also been squashed
+  by a sigmoid function. If you wish to change the preprocessing setting, you may also need to change the decoder setting accordingly.
+  This can be found in coding.py.
+  
+File Structure:
+- main functions:
+  main_train_impute.py: implements the training of partial VAE (PNP) part demonstrated on a UCI dataset.
+  main_active_learning.py: implements EDDI strategy, together with a global single ordering strategy based on partial VAE demonstrated on a UCI dataset
+                           it will also generate a information curve plot.
+- decoder-encoder functions: coding.py
+- partial VAE class:p_vae.py
+- training-impute functions: train_and_test_functions.py
+- training-active learning functions:active_learning_functions.py
+- active learning visualization: boston_bar_plot.py, this will visualize the decision process of eddi on Boston Housing data.
+- data: data/boston/d0.xls
+
+Dependencies:
+- tensorflow 1.4.0
+- scipy 1.10
+- sklearn 0.19.1
+- argparse 1.1
--- a/active_learning_functions.py
+++ b/active_learning_functions.py
@ -0,0 +1,405 @@
+from p_vae import *
+from codings import *
+import numpy as np
+import tensorflow as tf
+from scipy.stats import bernoulli
+import argparse
+import os
+import random
+from random import sample
+
+#### parser configurations
+parser = argparse.ArgumentParser(
+    description='EDDI')
+parser.add_argument(
+    '--epochs',
+    type=int,
+    default=3000,
+    metavar='N_eps',
+    help='number of epochs to train (default: 3000)')
+parser.add_argument(
+    '--latent_dim',
+    type=int,
+    default=10,
+    metavar='LD',
+    help='latent dimension (default: 10)')
+parser.add_argument(
+    '--p',
+    type=float,
+    default=0.7,
+    metavar='probability',
+    help='dropout probability of artificial missingness during training')
+parser.add_argument(
+    '--iteration',
+    type=int,
+    default=-1,
+    metavar='it',
+    help='iterations per epoch. set to -1 to run the full epoch. ')
+parser.add_argument(
+    '--batch_size',
+    type=int,
+    default=100,
+    metavar='batch',
+    help='Mini Batch size per epoch.  ')
+parser.add_argument(
+    '--K',
+    type=int,
+    default=20,
+    metavar='K',
+    help='Dimension of PNP feature map ')
+parser.add_argument(
+    '--M',
+    type=int,
+    default=50,
+    metavar='M',
+    help='Number of MC samples when perform imputing')
+parser.add_argument(
+    '--repeat',
+    type=int,
+    default=10,
+    metavar='repeat',
+    help='Number of repeats of the active learning experiment')
+parser.add_argument(
+    '--output_dir',
+    type=str,
+    default=os.getenv('PT_OUTPUT_DIR', '/tmp'))
+parser.add_argument(
+    '--data_dir',
+    type=str,
+    default=os.getenv('PT_DATA_DIR', 'data'),
+    help='Directory where UCI dataset is stored.')
+args = parser.parse_args()
+
+#### Set directories
+UCI = args.data_dir
+ENCODER_WEIGHTS = os.path.join(args.output_dir, 'encoder.tensorflow')
+FINETUNED_DECODER_WEIGHTS = os.path.join(args.output_dir, 'generator.tensorflow')
+rs = 42 # random seed
+
+def p_vae_active_learning(Data_train,mask_train,Data_test,mask_test,epochs,latent_dim,batch_size,p,K,M,Repeat,estimation_method=0):
+    '''
+    This function loads a pretrained p-VAE model, and performs active learning using single global strategy.
+    Note that we assume that the last column of x is the target variable of interest
+    :param Data_train: training data matrix
+    :param mask_train: mask matrix that indicates the missingness of training data. 1=observed, 0 = missing
+    :param Data_test: test data matrix
+    :param mask_test: mask matrix that indicates the missingness of test data. 1=observed, 0 = missing
+    :param latent_dim: latent dimension of partial VAE.
+    :param K: dimension of feature map of PNP encoder
+    :param M: number of samples used for MC sampling
+    :param Repeat: number of repeats.
+    :param estimation_method: what method to use for single ordering information reward estimation.
+            In order to calculate the single best ordering, we need to somehow marginalize (average) the
+            information reward over the data set (in this case, the test set).
+            we provide two methods of marginalization.
+            - estimation_method = 0: information reward marginalized using the model distribution p_{vae_model}(x_o).
+            - estimation_method = 1: information reward marginalized using the data distribution p_{data}(x_o)
+    :return: None (active learning results are saved to args.output_dir)
+    '''
+    for r in range(Repeat):
+        ## train partial VAE
+        tf.reset_default_graph()
+        vae = train_p_vae(Data_train,mask_train, epochs, latent_dim,batch_size, p, K,10)
+        n_test = Data_test.shape[0]
+        n_train = Data_train.shape[0]
+        OBS_DIM = Data_test.shape[1]
+
+        # kwargs = {
+        #     'K': K,
+        #     'obs_distrib': "Gaussian",
+        #     'latent_dim': latent_dim,
+        #     'encoder': PNP_fc_uci_encoder,
+        #     'decoder': fc_uci_decoder,
+        #     'obs_dim': OBS_DIM,
+        #     'load_model': 1,
+        #     'decoder_path': FINETUNED_DECODER_WEIGHTS,
+        #     'encoder_path': ENCODER_WEIGHTS,
+        # }
+        # vae = PN_Plus_VAE(**kwargs)
+
+        ## create arrays to store results
+        if r == 0:
+            # information curves
+            information_curve_RAND = np.zeros(
+                (Repeat, n_test, OBS_DIM - 1  + 1))
+            information_curve_SING = np.zeros(
+                (Repeat, n_test, OBS_DIM - 1  + 1))
+            information_curve_CHAI = np.zeros(
+                (Repeat, n_test, OBS_DIM - 1 + 1))
+
+            # history of optimal actions
+            action_SING = np.zeros((Repeat, n_test,
+                                    OBS_DIM - 1 ))
+            action_CHAI = np.zeros((Repeat, n_test,
+                                    OBS_DIM - 1))
+
+            # history of information reward values
+            R_hist_SING = np.zeros(
+                (Repeat, OBS_DIM - 1 , n_test,
+                 OBS_DIM - 1 ))
+            R_hist_CHAI = np.zeros(
+                (Repeat, OBS_DIM - 1, n_test,
+                 OBS_DIM - 1))
+
+            # history of posterior samples of partial inference
+            im_SING = np.zeros((Repeat, OBS_DIM - 1 , M,
+                                n_test, OBS_DIM ))
+            im_CHAI = np.zeros((Repeat, OBS_DIM - 1, M,
+                                n_test, OBS_DIM))
+
+        ## Perform active variable selection with random ordiner and SING (single sequence)
+        for strategy in range(3):
+
+            if strategy == 0:### random strategy
+                ## create arrays to store data and missingness
+                x = Data_test[:, :]  #
+                x = np.reshape(x, [n_test, OBS_DIM])
+                mask = np.zeros((n_test, OBS_DIM))
+                mask[:, -1] = 0  # we will never observe target value
+
+                ## initialize array that stores optimal actions (i_optimal)
+                i_optimal = [
+                    nums for nums in range(OBS_DIM - 1 )
+                ]
+                i_optimal = np.tile(i_optimal, [n_test, 1])
+                random.shuffle([random.shuffle(c) for c in i_optimal])
+
+                ## evaluate likelihood at initial stage (no observation)
+                negative_predictive_llh, uncertainty = vae.predictive_loss(
+                    x, mask, M)
+                information_curve_RAND[r, :, 0] = negative_predictive_llh
+                for t in range(OBS_DIM - 1 ):
+                    print("Repeat = {:.1f}".format(r))
+                    print("Strategy = {:.1f}".format(strategy))
+                    print("Step = {:.1f}".format(t))
+                    io = np.eye(OBS_DIM)[i_optimal[:, t]]
+                    mask = mask + io
+                    negative_predictive_llh, uncertainty = vae.predictive_loss(
+                        x, mask, M)
+                    information_curve_RAND[r, :, t +
+                                           1] = negative_predictive_llh
+
+            elif strategy == 1:### single ordering strategy "SING"
+                #SING is obtrained by maximize mean information reward for each step for the test set to be consistant with the description in the paper.
+                #We can also get this order by using a subset of training set to obtain the optimal ordering and apply this to the testset.
+                x = Data_test[:, :]  #
+                x = np.reshape(x, [n_test, OBS_DIM])
+                mask = np.zeros((n_test, OBS_DIM)) # this stores the mask of missingness (stems from both test data missingness and unselected features during active learing)
+                mask2 = np.zeros((n_test, OBS_DIM)) # this stores the mask indicating that which features has been selected of each data
+                mask[:, -1] = 0  # Note that no matter how you initialize mask, we always keep the target variable (last column) unobserved.
+
+                negative_predictive_llh, uncertainty = vae.predictive_loss(
+                    x, mask, M)
+                information_curve_SING[r, :, 0] = negative_predictive_llh
+
+                for t in range(OBS_DIM - 1 ): # t is a indicator of step
+                    print("Repeat = {:.1f}".format(r))
+                    print("Strategy = {:.1f}".format(strategy))
+                    print("Step = {:.1f}".format(t))
+                    ## note that for single ordering, there are two rewards.
+                    # The first one (R) is calculated based on no observations.
+                    # This is used for active learning phase, since single ordering should not depend on observations.
+                    # The second one (R_eval) is calculated in the same way as chain rule approximation. This is only used for visualization.
+                    R = -1e4 * np.ones(
+                        (n_test, OBS_DIM - 1)
+                    )
+
+
+                    im_0 = completion(x, mask*0, M, vae) # sample from model prior
+                    im = completion(x, mask, M, vae) # sample conditional on observations
+                    im_SING[r, t, :, :, :] = im
+
+                    for u in range(OBS_DIM - 1): # u is the indicator for features. calculate reward function for each feature candidates
+                        loc = np.where(mask2[:, u] == 0)[0]
+
+
+                        if estimation_method == 0:
+                            R[loc, u] = R_lindley_chain(u, x, mask, M, vae, im_0,loc)
+                        else:
+                            R[loc, u] = R_lindley_chain(u, x, mask, M, vae, im, loc)
+
+                    R_hist_SING[r, t, :, :] = R
+                    i_optimal = (R.mean(axis=0)).argmax() # optimal decision based on reward averaged on all data
+                    i_optimal = np.tile(i_optimal, [n_test])
+                    io = np.eye(OBS_DIM)[i_optimal]
+
+                    action_SING[r, :, t] = i_optimal
+                    mask = mask + io*mask_test # this mask takes into account both data missingness and missingness of unselected features
+                    negative_predictive_llh, uncertainty = vae.predictive_loss(
+                        x, mask, M)
+                    mask2 = mask2 + io # this mask only stores missingess of unselected features, i.e., which features has been selected of each data
+                    information_curve_SING[r, :, t +
+                                           1] = negative_predictive_llh
+
+
+            elif strategy == 2:  ### EDDI strategy (chain rule approximation)
+                # personalized active feature selection strategy
+                ## create arrays to store data and missingness
+                x = Data_test[:, :]  #
+                x = np.reshape(x, [n_test, OBS_DIM])
+                mask = np.zeros((n_test, OBS_DIM))  # this stores the mask of missingness (stems from both test data missingness and unselected features during active learing)
+                mask2 = np.zeros((n_test,OBS_DIM))  # this stores the mask indicating that which features has been selected of each data
+                mask[:,-1] = 0  # Note that no matter how you initialize mask, we always keep the target variable (last column) unobserved.
+                ## evaluate likelihood at initial stage (no observation)
+                negative_predictive_llh, uncertainty = vae.predictive_loss(
+                    x, mask, M)
+                information_curve_CHAI[r, :, 0] = negative_predictive_llh
+
+                for t in range(OBS_DIM - 1): # t is a indicator of step
+                    print("Repeat = {:.1f}".format(r))
+                    print("Strategy = {:.1f}".format(strategy))
+                    print("Step = {:.1f}".format(t))
+                    R = -1e4 * np.ones((n_test, OBS_DIM - 1))
+                    im = completion(x, mask, M, vae)
+                    im_CHAI[r, t, :, :, :] = im
+
+                    for u in range(OBS_DIM - 1): # u is the indicator for features. calculate reward function for each feature candidates
+                        loc = np.where(mask[:, u] == 0)[0]
+
+                        R[loc, u] = R_lindley_chain(u, x, mask, M, vae, im,
+                                                    loc)
+                    R_hist_CHAI[r, t, :, :] = R
+                    i_optimal = R.argmax(axis=1)
+                    io = np.eye(OBS_DIM)[i_optimal]
+
+                    action_CHAI[r, :, t] = i_optimal
+
+                    mask = mask + io # this mask takes into account both data missingness and missingness of unselected features
+                    negative_predictive_llh, uncertainty = vae.predictive_loss(
+                        x, mask, M)
+                    mask2 = mask2 + io # this mask only stores missingess of unselected features, i.e., which features has been selected of each data
+                    print(mask2[0:5, :])
+
+                    information_curve_CHAI[r, :, t +
+                                                 1] = negative_predictive_llh
+
+    # Save results
+    np.savez(
+        os.path.join(
+            args.output_dir,
+            'UCI_information_curve_RAND.npz'),
+        information_curve=information_curve_RAND)
+    np.savez(
+        os.path.join(
+            args.output_dir,
+            'UCI_information_curve_SING.npz'),
+        information_curve=information_curve_SING)
+    np.savez(
+        os.path.join(
+            args.output_dir,
+            'UCI_information_curve_CHAI.npz'),
+        information_curve=information_curve_CHAI)
+    np.savez(
+        os.path.join(args.output_dir,
+                     'UCI_action_SING.npz'),
+        action=action_SING)
+    np.savez(
+        os.path.join(args.output_dir,
+                     'UCI_action_CHAI.npz'),
+        action=action_CHAI)
+    np.savez(
+        os.path.join(args.output_dir,
+                     'UCI_R_hist_SING.npz'),
+        R_hist=R_hist_SING)
+    np.savez(
+        os.path.join(args.output_dir,
+                     'UCI_R_hist_CHAI.npz'),
+        R_hist=R_hist_CHAI)
+    np.savez(
+        os.path.join(args.output_dir,
+                     'UCI_im_SING.npz'),
+        im=im_SING)
+    np.savez(
+        os.path.join(args.output_dir,
+                     'UCI_im_CHAI.npz'),
+        im=im_CHAI)
+
+    return None
+
+def train_p_vae(Data_train,mask_train, epochs, latent_dim,batch_size, p, K,iteration):
+    '''
+        This function trains the partial VAE.
+        :param Data_train: training Data matrix, N by D
+        :param mask_train: mask matrix that indicates the missingness. 1=observed, 0 = missing
+        :param epochs: number of epochs of training
+        :param LATENT_DIM: latent dimension for partial VAE model
+        :param p: dropout rate for creating additional missingness during training
+        :param K: dimension of feature map of PNP encoder
+        :param iteration: how many mini-batches are used each epoch. set to -1 to run the full epoch.
+        :return: trained VAE, together with the test data used for testing.
+        '''
+
+    obs_dim = Data_train.shape[1]
+    n_train = Data_train.shape[0]
+    list_train = np.arange(n_train)
+    ####### construct
+    kwargs = {
+        'K': K,
+        'obs_distrib': "Gaussian",
+        'latent_dim': latent_dim,
+        'batch_size': batch_size,
+        'encoder': PNP_fc_uci_encoder,
+        'decoder': fc_uci_decoder,
+        'obs_dim': obs_dim,
+        'load_model':0,
+        'decoder_path': FINETUNED_DECODER_WEIGHTS,
+        'encoder_path': ENCODER_WEIGHTS,
+    }
+    vae = PN_Plus_VAE(**kwargs)
+
+
+
+    if iteration == -1:
+        n_it = int(np.ceil(n_train / kwargs['batch_size']))
+    else:
+        n_it = iteration
+
+    for epoch in range(epochs):
+        training_loss_full = 0.
+
+        # test_loss, test_kl, test_recon = vae.full_batch_loss(Data_test,mask_test)
+        # test_loss = test_loss
+        # test_kl = test_kl / n_test
+        # test_recon = test_recon / n_test
+
+        # iterate through batches
+
+        # np.random.shuffle(list_train)
+        for it in range(n_it):
+
+            if iteration == -1:
+                batch_indices = list_train[it:it + min(kwargs['batch_size'], n_train - 1)]
+            else:
+                batch_indices = sample(range(n_train), kwargs['batch_size'])
+
+            x = Data_train[batch_indices, :]
+            mask_train_batch = mask_train[batch_indices, :]
+            # DROPOUT_TRAIN = np.minimum(np.random.rand(1), p)
+            # while True:
+            #     mask_drop = np.array([bernoulli.rvs(1 - DROPOUT_TRAIN, size=obs_dim)] *
+            #                          kwargs['batch_size'])
+            #     if np.sum(mask_drop>0):
+            #         break
+
+            DROPOUT_TRAIN = np.minimum(np.random.rand(kwargs['batch_size'], obs_dim), p)
+            while True:
+                # mask_drop = np.array([bernoulli.rvs(1 - DROPOUT_TRAIN)] )
+                mask_drop = bernoulli.rvs(1 - DROPOUT_TRAIN)
+                if np.sum(mask_drop > 0):
+                    break
+
+            mask_drop = mask_drop.reshape([kwargs['batch_size'], obs_dim])
+            _ = vae.update(x, mask_drop*mask_train_batch)
+            loss_full, _, _ = vae.full_batch_loss(x,mask_drop*mask_train_batch)
+            training_loss_full += loss_full
+
+        # average loss over most recent epoch
+        training_loss_full /= n_it
+        print(
+            'Epoch: {} \tnegative training ELBO per observed feature: {:.2f}'
+            .format(epoch, training_loss_full))
+
+    vae.save_generator(FINETUNED_DECODER_WEIGHTS)
+    vae.save_encoder(ENCODER_WEIGHTS)
+
+    return vae
--- a/boston_bar_plot.py
+++ b/boston_bar_plot.py
@ -0,0 +1,172 @@
+'''
+- This file demonstrates the visualization of active learning process of Boston Housing.
+
+    Please first run the active learning in main_active_learning.py
+    Then, to run this file:
+    python boston_bar_plot.py -- data_dir your_directory/data/boston/ -- output_dir your_directory/model/
+
+- Each figure named in the format "_SING_avg_bars_x.png" (where x is the step of active learning) shows the distribution
+of information reward function of each feature at this specific step on the y-axis.
+All unobserved variables start with green bars, and turns purple once selected by the algorithm. . Step 0 corresponds to no features has been selected.
+All reward will be updated after each decision of feature selection, and the reward of selected features will
+will be fixed.
+
+- Each figure named in the format "_SING_avg_violins_x.png" (where x is the step of active learning) shows the
+violin plot of the posterior density estimations of remaining unobserved variables at this specific step
+
+'''
+import pandas as pd
+import numpy as np
+import argparse
+import os
+import matplotlib.pyplot as plt
+from sklearn.model_selection import train_test_split
+import sklearn.preprocessing as preprocessing
+
+
+parser = argparse.ArgumentParser(
+    description='EDDI')
+parser.add_argument(
+    '--output_dir',
+    type=str,
+    default=os.getenv('PT_OUTPUT_DIR', '/tmp'))
+parser.add_argument(
+    '--data_dir',
+    type=str,
+    default=os.getenv('PT_DATA_DIR', 'data'),
+    help='Directory where UCI dataset is stored.')
+args = parser.parse_args()
+
+
+#### load data
+data_grid = np.array([0,1,2,3,4,5,6,9,10,12])
+data = 1
+rs = 42
+UCI = args.data_dir
+Data = pd.read_excel(UCI + 'd0.xls')
+Data = Data.as_matrix()
+OBS_DIM = Data.shape[1]
+### UCI data preprocess
+if data_grid[data-1] == 12:
+    size_feature = Data.shape[1]
+    n_data = Data.shape[0]
+else:
+    max_Data = 1  #
+    min_Data = 0  #
+    Data_std = (Data - Data.min(axis=0)) / (Data.max(axis=0) - Data.min(axis=0))
+    Data = Data_std * (max_Data - min_Data) + min_Data
+    size_feature = Data.shape[1]
+    n_data = Data.shape[0]
+
+
+Data_train, Data_test, Data_train, Data_test = train_test_split(
+        Data, Data, test_size=0.1, random_state=rs)
+
+
+
+
+
+UCI_SING = args.output_dir
+
+
+npzfile = np.load(UCI_SING+'UCI_R_hist_SING.npz')
+R_SING=npzfile['R_hist']
+R_SING = np.maximum(R_SING,0)
+npzfile = np.load(UCI_SING+'UCI_action_SING.npz')
+A_SING=npzfile['action']
+npzfile = np.load(UCI_SING+'UCI_im_SING.npz')
+im_SING=npzfile['im']
+
+K = 3
+T=10
+repeat = 0
+        
+
+
+
+### SING strategy over all data
+
+pos = np.arange(R_SING.shape[1])+1
+A_SING_0_t = []
+A_SING_t_minus_1 = []
+R_SING_t_minus_1 = []
+bars1 = []
+r1 = []
+r2 = pos
+for t in range(T):
+    # Create bars T=1
+    A_SING_t = A_SING[repeat,K,t]
+    R_SING_t = R_SING[repeat,t,:,:].mean(axis=0)*100
+    barWidth = 0.9
+    bars2 = R_SING_t
+    if t >0:
+        bars1 = np.append(bars1, R_SING_t_minus_1[int(A_SING_t_minus_1)])
+        bars2 = np.delete(bars2,A_SING_0_t.astype(int))
+     
+    # The X position of bars
+    if t>0:
+        r1 = np.append(r1,int(A_SING_t_minus_1)+1)
+        r2 = np.delete(r2,np.argwhere(r2==int(A_SING_t_minus_1)+1))
+     
+    # Create barplot
+    if t>0:
+        plt.bar(r1, bars1, width = barWidth, color = (0.3,0.1,0.4,0.6), label='Selected')
+    plt.bar(r2, bars2, width = barWidth, color = (0.3,0.9,0.4,0.6), label='Unselected')
+    # Note: the barplot could be created easily. See the barplot section for other examples.
+     
+    # Create legend
+    plt.legend()
+     
+    # Text below each barplot with a rotation at 90°
+    #plt.xticks([r + barWidth for r in range(R_SING.shape[1])], ['crime rate','proportion of residential land zoned','proportion of non-retail business acres',' Charles River',' nitric oxides concentration','average number of rooms','owner-occupied units built prior to 1940','distances to Boston centres','accessibility to radial highways','property-tax rate',' pupil-teacher ratio','proportion of blacks','lower status of the population'], rotation=90)
+    plt.xticks([r + barWidth for r in range(R_SING.shape[1])], ['CR','PRD','PNB',' CHR',' NOC','ANR','OUB','DTB','ARH','TAX',' OTR','PB','LSP',])
+    
+    # Create labels
+     
+    # Adjust the margins
+    #plt.subplots_adjust(bottom= 0.3, top = 0.98)
+     
+    # Show graphic
+    plt.savefig(UCI_SING+'_SING_avg_bars_'+str(t)+'.png', format='png', dpi=200)
+    plt.show()
+    
+    
+    A_SING_t_minus_1 = A_SING_t
+    R_SING_t_minus_1 = R_SING_t
+    A_SING_0_t = np.append(A_SING_0_t,A_SING_t)
+    
+    
+     ## plot voilin plot
+    im = im_SING[repeat,t,:,K,:]
+    target = Data_test[K,:]
+    if t >0:
+        im[:,(r1-1).astype(int)] = target[(r1-1).astype(int)]
+        
+    M = im.shape[0]
+    obs_dim = im.shape[1]
+    GR_name = np.array(['CR','PRD','PNB',' CHR',' NOC','ANR','OUB','DTB','ARH','TAX',' OTR','PB','LSP','PRC'])
+    GR_label_im = np.array([])
+    GR_label_target = np.array([])
+    IM = np.reshape(im.T,(M*obs_dim,))
+    for i in range(M*obs_dim):
+        GR_label_im = np.append(GR_label_im,   GR_name[int(np.floor(i/M))])
+    df_im = pd.DataFrame(dict(Score = IM, Group = GR_label_im))
+    
+    
+    for i in range(obs_dim):
+        GR_label_target = np.append(GR_label_target,   GR_name[int(i)])
+    df_target = pd.DataFrame(dict(Score = target, Group = GR_label_target))
+    
+    
+    # library &amp; dataset
+    import seaborn as sns
+     
+    
+    # Use a color palette
+    plt.legend()
+    ax1 = sns.violinplot( x=df_im["Group"], y=df_im["Score"],palette="Blues",scale="count",bw=0.5)
+    ax1.set_ylabel('') 
+    ax2 = sns.pointplot(x=df_target["Group"], y=df_target["Score"], join = False,markers = 'x',scatter_kws={"s": 0.1},color = 'black')
+    ax2.set_ylabel('') 
+    plt.savefig(UCI_SING+'SING_avg_violins_'+str(t)+'.png', format='png', dpi=200)
+    plt.show()
--- a/codings.py
+++ b/codings.py
@ -0,0 +1,28 @@
+import tensorflow as tf
+from tensorflow.contrib import layers
+from tensorflow.contrib.framework import arg_scope
+
+def fc_uci_decoder(z,  obs_dim,  activation=tf.nn.sigmoid): #only output means since the model is N(m,sigmaI) or bernouli(m)
+    x = layers.fully_connected(z, 50, scope='fc-01')
+    x = layers.fully_connected(x, 100, scope='fc-02')
+    x = layers.fully_connected(x, obs_dim, activation_fn=tf.nn.sigmoid,
+                               scope='fc-final')
+
+    return x, None
+
+def fc_uci_encoder(x, latent_dim, activation=None):
+    e = layers.fully_connected(x, 100, scope='fc-01')
+    e = layers.fully_connected(e, 50, scope='fc-02')
+    e = layers.fully_connected(e, 2 * latent_dim, activation_fn=activation,
+                               scope='fc-final')
+
+    return e
+
+def PNP_fc_uci_encoder(x, K, activation=None):
+    e = layers.fully_connected(x, 100, scope='fc-01')
+    e = layers.fully_connected(e, 50, scope='fc-02')
+    e = layers.fully_connected(e, K, scope='fc-final')
+
+    return e
+
+
--- a/main_active_learning.py
+++ b/main_active_learning.py
@ -0,0 +1,113 @@
+'''
+EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE
+This code implements EDDI active learning strategy, together with a global single ordering strategy
+ based on partial VAE (PNP), demonstrated on a UCI dataset.
+
+To run this code:
+python main_active_learning.py  --epochs 3000  --latent_dim 10 --p 0.99 --data_dir your_directory/data/boston --output_dir your_directory/model
+
+
+possible arguments:
+- epochs: number of epochs.
+- latent_dim: size of latent space of partial VAE.
+- p: upper bound for artificial missingness probability. For example, if set to 0.9, then during each training epoch, the algorithm will
+  randomly choose a probability smaller than 0.9, and randomly drops observations according to this probability.
+  Our suggestion is that if original dataset already contains missing data, you can just set p to 0.
+- batch_size: mini batch size for training. default: 100
+- iteration: iterations (number of minibatches) used per epoch. set to -1 to run the full epoch.
+  If your dataset is large, please set this to other values such as 10.
+- K: dimension of the feature map (h) dimension of PNP encoder. Default: 20
+- M: Number of MC samples when perform imputing. Default: 50
+- repeat: Number of repeats of the active learning experiment
+- data_dir: Directory where UCI dataset is stored.
+- output_dir: Directory where the trained model will be stored and loaded.
+
+Other comments:
+- We assume that the data is stored in an excel file named d0.xls,
+   and we assume that the last column is the target variable of interest (only used in active learning)
+   you may need to modify the load data section according to your task.
+- Note that this code assumes a Gaussian noise real valued data. You may need to modify the likelihood function for other types of data.
+- In preprocessing, we chose to squash the data to the range of 0 and 1. Therefore our decoder output has also been squashed
+  by a sigmoid function. If you wish to change the preprocessing setting, you may also need to change the decoder setting accordingly.
+  This can be found in coding.py.
+
+File Structure:
+- main functions:
+  main_train_impute.py: implements the training of partial VAE (PNP) part demonstrated on a UCI dataset.
+  main_active_learning.py: implements a global single ordering strategy based on partial VAE demonstrated on a UCI dataset
+                           it will also generate a information curve plot.
+- decoder-encoder functions: coding.py
+- partial VAE class:p_vae.py
+- training-impute functions: train_and_test_functions.py
+- training-active learning functions:active_learning_functions.py
+- active learning visualization: boston_bar_plot.py, this will visualize the decision process of eddi on Boston Housing data.
+- data: data/boston/d0.xls
+'''
+### load models and functions
+from active_learning_functions_dropfix import *
+#### Import libraries
+import matplotlib.pyplot as plt
+from sklearn.model_selection import train_test_split
+import pandas as pd
+import sklearn.preprocessing as preprocessing
+plt.switch_backend('agg')
+tfd = tf.contrib.distributions
+
+### load data
+Data = pd.read_excel(UCI + '/d0.xls')
+Data = Data.as_matrix()
+### data preprocess
+max_Data = 1  #
+min_Data = 0  #
+Data_std = (Data - Data.min(axis=0)) / (Data.max(axis=0) - Data.min(axis=0))
+Data = Data_std * (max_Data - min_Data) + min_Data
+Mask = np.ones(Data.shape) # this UCI data is fully observed
+Data_train, Data_test, mask_train, mask_test = train_test_split(
+        Data, Mask, test_size=0.1, random_state=rs)
+
+
+### run training and active learning
+# #number of experiments that you want to repeat
+Repeat = args.repeat
+#Training Partial VAE and then apply Random order feature selection (RAND in the paper) and single order feature selection (SING in the paper)
+#generate information curve and per step information gain with RAND and SING.
+# p_vae_active_learning(Data_train,mask_train,Data_test,mask_test,args.epochs,args.latent_dim,args.batch_size,args.p,args.K,args.M,Repeat)
+
+
+### visualize active learning
+npzfile = np.load(args.output_dir+'/UCI_information_curve_RAND.npz')
+IC_RAND=npzfile['information_curve']
+
+npzfile = np.load(args.output_dir+'/UCI_information_curve_SING.npz')
+IC_SING=npzfile['information_curve']
+
+npzfile = np.load(args.output_dir+'/UCI_information_curve_CHAI.npz')
+IC_CHAI=npzfile['information_curve']
+
+
+import matplotlib.pyplot as plt
+plt.figure(0)
+L = IC_SING.shape[1]
+fig, ax1 = plt.subplots()
+
+# These are in unitless percentages of the figure size. (0,0 is bottom left)
+left, bottom, width, height = [0.45, 0.4, 0.45, 0.45]
+
+ax1.plot((IC_RAND[:,:,0:].mean(axis=1)).mean(axis = 0),'gs',linestyle = '-.', label =  'PNP+RAND')
+ax1.errorbar(np.arange(IC_RAND.shape[2]), (IC_RAND[:,:,0:].mean(axis=1)).mean(axis = 0), yerr=(IC_RAND[:,:,0:].mean(axis=1)).std(axis = 0)/np.sqrt(IC_SING.shape[0]),ecolor='g',fmt = 'gs')
+ax1.plot((IC_SING[:,:,0:].mean(axis=1)).mean(axis = 0),'ms',linestyle = '-.', label = 'PNP+SING')
+ax1.errorbar(np.arange(IC_SING.shape[2]), (IC_SING[:,:L,0:].mean(axis=1)).mean(axis = 0), yerr=(IC_SING[:,0:L,0:].mean(axis=1)).std(axis = 0)/np.sqrt(IC_SING.shape[0]),ecolor='m',fmt = 'ms')
+ax1.plot((IC_CHAI[:,:,0:].mean(axis=1)).mean(axis = 0),'ks',linestyle = '-.', label = 'PNP+PERSONAL')
+ax1.errorbar(np.arange(IC_CHAI.shape[2]), (IC_CHAI[:,:L,0:].mean(axis=1)).mean(axis = 0), yerr=(IC_CHAI[:,0:L,0:].mean(axis=1)).std(axis = 0)/np.sqrt(IC_CHAI.shape[0]),ecolor='k',fmt = 'ks')
+
+plt.xlabel('Steps',fontsize=18)
+plt.ylabel('avg. neg. test likelihood',fontsize=18)
+plt.xticks(fontsize=18)
+plt.yticks(fontsize=18)
+
+ax1.legend(bbox_to_anchor=(0.0, 1.02, 1., .102), mode = "expand", loc=3,
+           ncol=1, borderaxespad=0.,prop={'size': 20}, frameon=False)
+plt.show()
+plt.savefig(args.output_dir+'/PNP_all_IC_curves.eps', format='eps', dpi=200,bbox_inches='tight')
+
+
--- a/main_train_and_impute.py
+++ b/main_train_and_impute.py
@ -0,0 +1,96 @@
+'''
+EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE
+This code implememts partial VAE (PNP) part demonstrated on a UCI dataset.
+
+To run this code:
+python main_train_and_impute.py  --epochs 3000  --latent_dim 10 --p 0.99 --data_dir your_directory/data/boston --output_dir your_directory/model
+
+possible arguments:
+- epochs: number of epochs.
+- latent_dim: size of latent space of partial VAE.
+- p: upper bound for artificial missingness probability. For example, if set to 0.9, then during each training epoch, the algorithm will
+  randomly choose a probability smaller than 0.9, and randomly drops observations according to this probability.
+  Our suggestion is that if original dataset already contains missing data, you can just set p to 0.
+- batch_size: mini batch size for training. default: 100
+- iteration: iterations (number of minibatches) used per epoch. set to -1 to run the full epoch.
+  If your dataset is large, please set to other values such as 10.
+- K: dimension of the feature map (h) dimension of PNP encoder. Default: 20
+- M: Number of MC samples when perform imputing. Default: 50
+- data_dir: Directory where UCI dataset is stored.
+- output_dir: Directory where the trained model will be stored and loaded.
+
+Other comments:
+- We assume that the data is stored in an excel file named d0.xls,
+   and we assume that the last column is the target variable of interest (only used in active learning)
+   you should modify the load data section according to your data.
+- Note that this code assumes a Gaussian noise real valued data. You may need to modify the likelihood function for other types of data.
+- In preprocessing, we chose to squash the data to the range of 0 and 1. Therefore our decoder output has also been squashed
+  by a sigmoid function. If you wish to change the preprocessing setting, you may also need to change the decoder setting accordingly.
+  This can be found in coding.py.
+
+File Structure:
+- main functions:
+  main_train_impute.py: implements the training of partial VAE (PNP) part demonstrated on a UCI dataset.
+  main_active_learning.py: implements the EDDI active learning strategy, together with a global single ordering strategy based on partial VAE demonstrated on a UCI dataset
+                           it will also generate a information curve plot.
+- decoder-encoder functions: coding.py
+- partial VAE class:p_vae.py
+- training-impute functions: train_and_test_functions.py
+- training-active learning functions:active_learning_functions.py
+- active learning visualization: boston_bar_plot.py, this will visualize the decision process of eddi on Boston Housing data.
+- data: data/boston/d0.xls
+
+'''
+### load models and functions
+from train_and_test_functions_dropfix import *
+#### Import libraries
+import matplotlib.pyplot as plt
+from sklearn.model_selection import train_test_split
+import random
+from random import sample
+import pandas as pd
+import sklearn.preprocessing as preprocessing
+plt.switch_backend('agg')
+tfd = tf.contrib.distributions
+
+### load data
+Data = pd.read_excel(UCI + '/d0.xls')
+Data = Data.as_matrix()
+### data preprocess
+max_Data = 1  #
+min_Data = 0  #
+Data_std = (Data - Data.min(axis=0)) / (Data.max(axis=0) - Data.min(axis=0))
+Data = Data_std * (max_Data - min_Data) + min_Data
+Mask = np.ones(Data.shape) # This is a mask indicating missingness, 1 = observed, 0 = missing.
+# this UCI data is fully observed. you should modify the set up of Mask if your data contains missing data.
+
+### split the data into train and test sets
+Data_train, Data_test, mask_train, mask_test = train_test_split(
+        Data, Mask, test_size=0.1, random_state=rs)
+
+### Train the model and save the trained model.
+vae = train_p_vae(Data_train,mask_train, args.epochs, args.latent_dim, args.batch_size,args.p, args.K,args.iteration)
+
+### Test imputating model on the test set
+## Calculate test ELBO of observed test data (will load the pre-trained model). Note that this is NOT imputing.
+tf.reset_default_graph()
+test_loss = test_p_vae_marginal_elbo(Data_test,mask_test, args.latent_dim,args.K)
+## Calculate imputation RMSE (conditioned on observed data. will load the pre-trained model)
+## Note that here we perform imputation on a new dataset, whose observed entries are not used in training.
+## this will under estimate the imputation performance, since in principle all observed entries should be used to train the model.
+tf.reset_default_graph()
+Data_ground_truth = Data_test
+mask_obs = np.array([bernoulli.rvs(1 - 0.3, size=Data_ground_truth.shape[1]*Data_ground_truth.shape[0])]) # manually create missing data on test set
+mask_obs = mask_obs.reshape(Data_ground_truth.shape)
+Data_observed = Data_ground_truth*mask_obs
+
+mask_target = 1-mask_obs
+# This line below is optional. Turn on this line means that we use the new comming testset to continue update the imputing model. Turn off this linea means that we only use the pre-trained model to impute without futher updating the model.
+# vae = train_p_vae(Data_ground_truth,mask_obs, args.epochs, args.latent_dim, args.batch_size,0, args.K,args.iteration)
+tf.reset_default_graph()
+RMSE = impute_p_vae(Data_observed,mask_obs,Data_ground_truth,mask_target,args.latent_dim,args.batch_size,args.K,args.M)
+
+
+
+
+
--- a/p_vae.py
+++ b/p_vae.py
@ -0,0 +1,519 @@
+import tensorflow as tf
+import numpy as np
+from tensorflow.contrib import layers
+from tensorflow.contrib.distributions import Normal
+import re
+import copy
+
+
+class PN_Plus_VAE(object):
+    def __init__(self,
+                 encoder,
+                 decoder,
+                 obs_dim,
+                 decoder_path,
+                 encoder_path,
+                 learning_rate=1e-3,
+                 optimizer=tf.train.AdamOptimizer,
+                 obs_distrib="Gaussian",
+                 obs_std=0.1*np.sqrt(2),
+                 K = 20,
+                 latent_dim = 10,
+                 batch_size = 100,
+                 load_model=0,
+                 M=5,
+                 all=1):
+        '''
+        :param encoder: type of encoder model choosen from coding.py
+        :param decoder: type of decoder model choosen from coding.py
+        :param obs_dim: maximum number of partial observational dimensions
+        :param encoder_path: path for saving encoder model parameter
+        :param decoder_path: path for saving decoder model parameter
+        :param learning_rate: optimizer learning rate
+        :param optimizer: we use Adam here.
+        :param obs_distrib: Bernoulli or Gaussian.
+        :param obs_std: observational noise for decoder.
+        :param K: length of code for summarizing partial observations
+        :param latent_dim: latent dimension of VAE
+        :param batch_size: training batch size
+        :param load_model: 1 = load a pre-trained model from decoder_path and encoder_path
+        :param M : number of MC samples used when performing imputing/prediction
+
+
+        '''
+        self._K = K
+        self._latent_dim = latent_dim
+        self._batch_size = batch_size
+        self._encode = encoder
+        self._decode = decoder
+        self._obs_dim = obs_dim
+        self._learning_rate = learning_rate
+        self._optimizer = optimizer
+        self._obs_distrib = obs_distrib
+        self._obs_std = obs_std
+        self._load_model = load_model
+        self._all = all
+        self._decoder_path = decoder_path
+        self._encoder_path = encoder_path
+        self._M = M
+        self._build_graph()
+
+
+    ## build partial VAE
+    def _build_graph(self):
+
+        with tf.variable_scope('is'):
+            # placeholder for UCI inputs
+            self.x = tf.placeholder(tf.float32, shape=[None, self._obs_dim])
+            self.x_flat = tf.reshape(self.x, [-1, 1])
+            # placeholder for masks
+            self.mask = tf.placeholder(tf.float32, shape=[None, self._obs_dim])
+            self._batch_size = tf.shape(self.x)[0]
+
+            # encode inputs (map to parameterization of diagonal Gaussian)
+            with tf.variable_scope('encoder'):
+                # the tensor F stores ID matrix
+                self.F = tf.get_variable(
+                    "F",
+                    shape=[1, self._obs_dim, 10],
+                    initializer=tf.contrib.layers.xavier_initializer())
+                self.F = tf.tile(self.F, [self._batch_size, 1, 1])
+                self.F = tf.reshape(self.F, [-1, 10])
+
+                self.b = tf.get_variable(
+                    "b",
+                    shape=[1, self._obs_dim, 1],
+                    initializer=tf.contrib.layers.xavier_initializer())
+                # bias vector
+                self.b = tf.tile(self.b, [self._batch_size, 1, 1])
+                self.b = tf.reshape(self.b, [-1, 1])
+
+                self.x_aug = tf.concat(
+                    [self.x_flat, self.x_flat * self.F, self.b], 1)
+                self.encoded = layers.fully_connected(self.x_aug, self._K)
+
+                self.encoded = tf.reshape(self.encoded,
+                                          [-1, self._obs_dim, self._K])
+                self.mask_on_hidden = tf.reshape(self.mask,
+                                                 [-1, self._obs_dim, 1])
+                self.mask_on_hidden = tf.tile(self.mask_on_hidden,
+                                              [1, 1, self._K])
+                self.encoded = tf.nn.relu(
+                    tf.reduce_sum(self.encoded * self.mask_on_hidden, 1))
+                self.encoded = layers.fully_connected(self.encoded, 500)
+                self.encoded = layers.fully_connected(self.encoded, 200)
+                self.encoded = layers.fully_connected(
+                    self.encoded, 2 * self._latent_dim, activation_fn=None)
+
+            with tf.variable_scope('sampling'):
+                # unpacking mean and (diagonal) variance of latent variable
+
+                self.mean = self.encoded[:, :self._latent_dim]
+                self.logvar = self.encoded[:, self._latent_dim:]
+                # also calculate standard deviation for practical use
+                self.stddev = tf.sqrt(tf.exp(self.logvar))
+
+                # sample from latent space
+                epsilon = tf.random_normal(
+                    [self._batch_size, self._latent_dim])
+                self.z = self.mean + self.stddev * epsilon
+
+            # decode batch
+            with tf.variable_scope('generator'):
+                self.decoded, _ = self._decode(self.z, self._obs_dim)
+            with tf.variable_scope('loss'):
+                # KL divergence between approximate posterior q and prior p
+                with tf.variable_scope('kl-divergence'):
+                    self.kl = self._kl_diagnormal_stdnormal(
+                        self.mean, self.logvar)
+                # loss likelihood
+                if self._obs_distrib == 'Bernoulli':
+                    with tf.variable_scope('bernoulli'):
+                        self.log_like = self._bernoulli_log_likelihood(
+                            self.x, self.decoded, self.mask)
+                else:
+                    with tf.variable_scope('gaussian'):
+                        self.log_like = self._gaussian_log_likelihood(
+                            self.x * self.mask, self.decoded * self.mask,
+                            self._obs_std)
+
+                self._loss = (self.kl + self.log_like) / tf.cast(
+                    self._batch_size, tf.float32)  # loss per instance (actual loss used)
+                self._loss_print = (self.kl + self.log_like) / tf.reduce_sum(
+                    self.mask)  # loss per feature, for tracking training process only
+
+            with tf.variable_scope('optimizer'):
+                optimizer = self._optimizer(learning_rate=self._learning_rate)
+
+            with tf.variable_scope('training-step'):
+                self._train = optimizer.minimize(self._loss)
+
+
+
+            if self._load_model == 1:
+                generator_variables = []
+                for v in tf.trainable_variables():
+                    if "generator" in v.name:
+                        generator_variables.append(v)
+
+                encoder_variables = []
+                for v in tf.trainable_variables():
+                    if "encoder" in v.name:
+                        encoder_variables.append(v)
+
+                # start tensorflow session
+                self._sesh = tf.Session()
+
+
+                load_encoder = tf.contrib.framework.assign_from_checkpoint_fn(
+                    self._encoder_path, encoder_variables)
+                load_encoder(self._sesh)
+                load_generator = tf.contrib.framework.assign_from_checkpoint_fn(
+                    self._decoder_path, generator_variables)
+                load_generator(self._sesh)
+
+                uninitialized_vars = []
+                for var in tf.all_variables():
+                    try:
+                        self._sesh.run(var)
+                    except tf.errors.FailedPreconditionError:
+                        uninitialized_vars.append(var)
+
+                init_new_vars_op = tf.variables_initializer(uninitialized_vars)
+                self._sesh.run(init_new_vars_op)
+            else:
+                self._sesh = tf.Session()
+                init = tf.global_variables_initializer()
+                self._sesh.run(init)
+
+    ## KL divergence
+    def _kl_diagnormal_stdnormal(self, mu, log_var):
+        '''
+        This function calculates KL divergence
+        :param mu: mean
+        :param log_var: log variance
+        :return:
+        '''
+
+        var = tf.exp(log_var)
+        kl = 0.5 * tf.reduce_sum(tf.square(mu) + var - 1. - log_var)
+        return kl
+
+
+    ## likelihood terms
+    @staticmethod
+    def _bernoulli_log_likelihood(targets, outputs, mask, eps=1e-8):
+        '''
+        This function comptutes negative log likelihood for Bernoulli likelihood
+        :param targets: test data
+        :param outputs: model predictions
+        :param mask: mask of missingness
+        :return: negative log llh
+        '''
+        eps = 0
+        log_like = -tf.reduce_sum(targets * (tf.log(outputs + eps) * mask) +
+                                  (1. - targets) *
+                                  (tf.log((1. - outputs) + eps) * mask))
+
+        return log_like
+
+    @staticmethod
+    def _gaussian_log_likelihood(targets, mean, std):
+        '''
+        This function computes negative log likelihood for Gaussians during training
+        Note that constant terms are dropped.
+        :param targets: test data
+        :param mean: mean
+        :param std: sigma
+        :return: negative log llh
+        '''
+        se = tf.reduce_sum(
+            0.5 * tf.square(targets - mean) / tf.cast(tf.square(std), tf.float32) + tf.cast(tf.log(std), tf.float32))
+        return se
+
+    ## optimization function
+    def update(self, x, mask):
+        '''
+        This function is used to update the model during training/optimization
+        :param x: training data
+        :param mask: mask that indicates observed data and missing data locations
+        '''
+
+        _, loss = self._sesh.run([self._train, self._loss_print],
+                                 feed_dict={
+                                     self.x: x,
+                                     self.mask: mask
+                                 })
+        return loss
+
+    ##
+    def full_batch_loss(self, x,mask):
+        '''
+        retrieve different components of loss function
+        :param x: dat matrix
+        :param mask: mask that indicates observed data and missing data locations
+        :return: overall loss (averaged over all entries), KL term, and reconstruction loss
+        '''
+
+        # mask = np.ones((x.shape[0], self._obs_dim))
+
+        loss, kl, recon = self._sesh.run(
+            [self._loss_print, self.kl, self.log_like],
+            feed_dict={
+                self.x: x,
+                self.mask: mask
+            })
+        return loss, kl, recon
+
+    ## predictive likelihood and uncertainties
+    def predictive_loss(self, x, mask, M):
+        '''
+        This function computes predictive losses (negative llh).
+        This is used for active learning phase.
+        We assume that the last column of x is the target variable of interest
+        :param x: data matrix, the last column of x is the target variable of interest
+        :param mask: mask that indicates observed data and missing data locations
+        :return: MAE and RMSE
+        '''
+
+        llh = 0
+        lh = 0
+        uncertainty_data = np.zeros((x.shape[0], M))
+        for m in range(M):
+            decoded = self._sesh.run(
+                self.decoded, feed_dict={
+                    self.x: x,
+                    self.mask: mask
+                })
+
+            target = x[:, -1]
+            output = decoded[:, -1]
+            uncertainty_data[:, m] = decoded[:, -1]
+            if self._obs_distrib == 'Bernoulli':
+                llh += target * (np.log(output + 1e-8)) + (1. - target) * (
+                    np.log((1. - output) + 1e-8))
+            else:
+                lh += np.exp(-0.5 * np.square(target - output) / (
+                    np.square(self._obs_std)) - np.log(self._obs_std) - 0.5 * np.log(2 * np.pi))
+
+        uncertainty = uncertainty_data.std(axis=1)
+
+        if self._obs_distrib == 'Bernoulli':
+            nllh = -llh / M
+        else:
+            nllh = -np.log(lh / M)
+
+        return nllh, uncertainty
+
+    def impute_losses(self, x, mask_obs, mask_target):
+        '''
+        This function computes imputation losses
+        :param x: data matrix
+        :param mask_obs: mask that indicates observed data and missing data locations
+        :param mask_target: mask that indicates the test data locations
+        :return: MAE and RMSE
+        '''
+
+        MAE = 0
+        RMSE = 0
+        for m in range(self._M):
+            decoded = self._sesh.run(self.decoded,
+                                     feed_dict={self.x: x, self.mask: mask_obs})
+
+            target = x * mask_target
+            output = decoded * mask_target
+            MAE += np.sum(np.abs(target - output)) / np.sum(mask_target)
+            RMSE += np.sqrt(np.sum(np.square(target - output)) / np.sum(mask_target))
+
+        MAE = MAE / self._M
+        RMSE = RMSE / self._M
+        return MAE, RMSE
+
+    ## generate partial inference samples
+    def im(self, x, mask):
+        '''
+        This function produces simulations of unobserved variables based on observed ones.
+        :param x: data matrix
+        :param mask: mask that indicates observed data and missing data locations
+        :return: im, which contains samples of completion.
+        '''
+
+        m, v = self._sesh.run([self.mean, self.stddev],
+                              feed_dict={
+                                  self.x: x,
+                                  self.mask: mask
+                              })
+
+        ep = np.random.normal(0, 1, [x.shape[0], self._latent_dim])
+        z = m + v * ep
+        im = self._sesh.run(self.decoded, feed_dict={self.z: z})
+
+        return im
+
+    ## calculate the first term of information reward approximation
+    def chaini_I(self, x, mask, i):
+        '''
+        calculate the first term of information reward approximation
+        used only in active learning phase
+        :param x: data
+        :param mask: mask of missingness
+        :param i: indicates the index of x_i
+        :return:  the first term of information reward approximation
+        '''
+        temp_mask = copy.deepcopy(mask)
+        m, v = self._sesh.run([self.mean, self.stddev],
+                              feed_dict={
+                                  self.x: x,
+                                  self.mask: temp_mask
+                              })
+
+        var = v**2
+        log_var = 2 * np.log(v)
+
+        temp_mask[:, i] = 1
+
+        m_i, v_i = self._sesh.run([self.mean, self.stddev],
+                                  feed_dict={
+                                      self.x: x,
+                                      self.mask: temp_mask
+                                  })
+
+        var_i = v_i**2
+        log_var_i = 2 * np.log(v_i)
+        kl_i = 0.5 * np.sum(
+            np.square(m_i - m) / v + var_i / var - 1. - log_var_i + log_var,
+            axis=1)
+
+        return kl_i
+
+    ## calculate the second term of information reward approximation
+    def chaini_II(self, x, mask, i):
+        '''
+        calculate the second term of information reward approximation
+        used only in active learning phase
+        Note that we assume that the last column of x is the target variable of interest
+        :param x: data
+        :param mask: mask of missingness
+        :param i: indicates the index of x_i
+        :return:  the second term of information reward approximation
+        '''
+        # mask: represents x_o
+        # targets: 0 by M vector, contains M samples from p(\phi|x_o)
+        # x : 1 by obs_dim vector, contains 1 instance of data
+        # i: indicates the index of x_i
+        temp_mask = copy.deepcopy(mask)
+        temp_mask[:, -1] = 1
+        m, v = self._sesh.run([self.mean, self.stddev],
+                              feed_dict={
+                                  self.x: x,
+                                  self.mask: temp_mask
+                              })
+
+        var = v**2
+        log_var = 2 * np.log(v)
+
+        temp_mask[:, i] = 1
+
+        m_i, v_i = self._sesh.run([self.mean, self.stddev],
+                                  feed_dict={
+                                      self.x: x,
+                                      self.mask: temp_mask
+                                  })
+
+        var_i = v_i**2
+        log_var_i = 2 * np.log(v_i)
+        kl_i = 0.5 * np.sum(
+            np.square(m_i - m) / v + var_i / var - 1. - log_var_i + log_var,
+            axis=1)
+
+        return kl_i
+
+
+    ## save model
+    def save_generator(self, path, prefix="is/generator"):
+        '''
+        This function saves generator parameters to path
+        '''
+        variables = tf.trainable_variables()
+        var_dict = {}
+        for v in variables:
+            if "generator" in v.name:
+                name = prefix + re.sub("is/generator", "", v.name)
+                name = re.sub(":0", "", name)
+                var_dict[name] = v
+        for k, v in var_dict.items():
+            print(k)
+            print(v)
+
+        saver = tf.train.Saver(var_dict)
+        saver.save(self._sesh, path)
+
+    def save_encoder(self, path, prefix="is/encoder"):
+        '''
+        This function saves encoder parameters to path
+        '''
+        variables = tf.trainable_variables()
+        var_dict = {}
+        for v in variables:
+            if "encoder" in v.name:
+                name = prefix + re.sub("is/encoder", "", v.name)
+                name = re.sub(":0", "", name)
+                var_dict[name] = v
+        for k, v in var_dict.items():
+            print(k)
+            print(v)
+
+        saver = tf.train.Saver(var_dict)
+        saver.save(self._sesh, path)
+
+
+### function to generate new samples conditioned on observations
+def completion(x, mask, M, vae):
+    '''
+    function to generate new samples conditioned on observations
+    :param x: underlying partially observed data
+    :param mask: mask of missingness
+    :param M: number of MC samples
+    :param vae: a pre-trained vae
+    :return: sampled missing data, a M by N by D matrix, where M is the number of samples.
+    '''
+
+    im = np.zeros((M, x.shape[0], x.shape[1]))
+
+    for m in range(M):
+        #tf.reset_default_graph()
+        np.random.seed(42 + m)  ### added for bar plots only
+        im[m, :, :] = vae.im(x, mask)
+
+    return im
+
+### function for computing reward function approximation
+def R_lindley_chain(i, x, mask, M, vae, im, loc):
+    '''
+    function for computing reward function approximation
+    :param i: indicates the index of x_i
+    :param x: data matrix
+    :param mask: mask of missingness
+    :param M: number of MC samples
+    :param vae: a pre-trained vae
+    :param im: sampled missing data, a M by N by D matrix, where M is the number of samples.
+    :return:
+    '''
+
+    im_i = im[:, :, i]
+    #print(im_i)
+    approx_KL = 0
+    im_target = im[:, :, -1]
+    temp_x = copy.deepcopy(x)
+    for m in range(M):
+        temp_x[loc, i] = im_i[m, loc]
+        KL_I = vae.chaini_I(temp_x[loc, :], mask[loc, :], i)
+        temp_x[loc, -1] = im_target[m, loc]
+        KL_II = vae.chaini_II(temp_x[loc, :], mask[loc, :], i)
+
+        approx_KL += KL_I
+        approx_KL -= KL_II
+
+    R = approx_KL / M
+
+    return R
--- a/train_and_test_functions.py
+++ b/train_and_test_functions.py
@ -0,0 +1,245 @@
+
+
+from p_vae import *
+from codings import *
+import numpy as np
+import tensorflow as tf
+from scipy.stats import bernoulli
+import argparse
+import os
+#### parser configurations
+parser = argparse.ArgumentParser(
+    description='EDDI')
+parser.add_argument(
+    '--epochs',
+    type=int,
+    default=3000,
+    metavar='N_eps',
+    help='number of epochs to train (default: 3000)')
+parser.add_argument(
+    '--latent_dim',
+    type=int,
+    default=10,
+    metavar='LD',
+    help='latent dimension (default: 10)')
+parser.add_argument(
+    '--p',
+    type=float,
+    default=0.7,
+    metavar='probability',
+    help='dropout probability of artificial missingness during training')
+parser.add_argument(
+    '--iteration',
+    type=int,
+    default=-1,
+    metavar='it',
+    help='iterations per epoch. set to -1 to run the full epoch. ')
+parser.add_argument(
+    '--batch_size',
+    type=int,
+    default=100,
+    metavar='batch',
+    help='Mini Batch size per epoch.  ')
+parser.add_argument(
+    '--K',
+    type=int,
+    default=20,
+    metavar='K',
+    help='Dimension of PNP feature map ')
+parser.add_argument(
+    '--M',
+    type=int,
+    default=50,
+    metavar='M',
+    help='Number of MC samples when perform imputing')
+parser.add_argument(
+    '--output_dir',
+    type=str,
+    default=os.getenv('PT_OUTPUT_DIR', '/tmp'))
+parser.add_argument(
+    '--data_dir',
+    type=str,
+    default=os.getenv('PT_DATA_DIR', 'data'),
+    help='Directory where UCI dataset is stored.')
+args = parser.parse_args()
+
+#### Set directories
+UCI = args.data_dir
+ENCODER_WEIGHTS = os.path.join(args.output_dir, 'encoder.tensorflow')
+FINETUNED_DECODER_WEIGHTS = os.path.join(args.output_dir, 'generator.tensorflow')
+rs = 42 # random seed
+
+def train_p_vae(Data_train,mask_train, epochs, latent_dim,batch_size, p, K,iteration):
+    '''
+        This function trains the partial VAE.
+        :param Data_train: training Data matrix, N by D
+        :param mask_train: mask matrix that indicates the missingness. 1=observed, 0 = missing
+        :param epochs: number of epochs of training
+        :param LATENT_DIM: latent dimension for partial VAE model
+        :param p: dropout rate for creating additional missingness during training
+        :param K: dimension of feature map of PNP encoder
+        :param iteration: how many mini-batches are used each epoch. set to -1 to run the full epoch.
+        :return: trained VAE, together with the test data used for testing.
+        '''
+
+    obs_dim = Data_train.shape[1]
+    n_train = Data_train.shape[0]
+    list_train = np.arange(n_train)
+    ####### construct
+    kwargs = {
+        'K': K,
+        'obs_distrib': "Gaussian",
+        'latent_dim': latent_dim,
+        'batch_size': batch_size,
+        'encoder': PNP_fc_uci_encoder,
+        'decoder': fc_uci_decoder,
+        'obs_dim': obs_dim,
+        'load_model':0,
+        'decoder_path': FINETUNED_DECODER_WEIGHTS,
+        'encoder_path': ENCODER_WEIGHTS,
+    }
+    vae = PN_Plus_VAE(**kwargs)
+
+
+
+    if iteration == -1:
+        n_it = int(np.ceil(n_train / kwargs['batch_size']))
+    else:
+        n_it = iteration
+
+    for epoch in range(epochs):
+        training_loss_full = 0.
+
+        # test_loss, test_kl, test_recon = vae.full_batch_loss(Data_test,mask_test)
+        # test_loss = test_loss
+        # test_kl = test_kl / n_test
+        # test_recon = test_recon / n_test
+
+        # iterate through batches
+
+        # np.random.shuffle(list_train)
+        for it in range(n_it):
+
+            if iteration == -1:
+                batch_indices = list_train[it:it + min(kwargs['batch_size'], n_train - 1)]
+            else:
+                batch_indices = sample(range(n_train), kwargs['batch_size'])
+
+            x = Data_train[batch_indices, :]
+            mask_train_batch = mask_train[batch_indices, :]
+            DROPOUT_TRAIN = np.minimum(np.random.rand(kwargs['batch_size'],obs_dim), p)
+            while True:
+                # mask_drop = np.array([bernoulli.rvs(1 - DROPOUT_TRAIN)] )
+                mask_drop = bernoulli.rvs(1 - DROPOUT_TRAIN)
+                if np.sum(mask_drop>0):
+                    break
+
+            # mask_drop = mask_drop.reshape([kwargs['batch_size'], obs_dim])
+            _ = vae.update(x, mask_drop*mask_train_batch)
+            loss_full, _, _ = vae.full_batch_loss(x,mask_drop*mask_train_batch)
+            training_loss_full += loss_full
+
+        # average loss over most recent epoch
+        training_loss_full /= n_it
+        print(
+            'Epoch: {} \tnegative training ELBO per observed feature: {:.2f}'
+            .format(epoch, training_loss_full))
+
+    vae.save_generator(FINETUNED_DECODER_WEIGHTS)
+    vae.save_encoder(ENCODER_WEIGHTS)
+
+    return vae
+
+def test_p_vae_marginal_elbo(Data_test,mask_test,latent_dim,K):
+    '''
+    This function computes the marginal (negative) ELBO of observed test data
+    Note that this function does not perform imputing.
+    :param Data_test: test data matrix
+    :param mask_test: mask matrix that indicates the missingness. 1=observed, 0 = missing
+    :param latent_dim: latent dimension for partial VAE model
+    :param K:dimension of feature map of PNP encoder
+    :return: test negative ELBO
+    '''
+
+    obs_dim = Data_test.shape[1]
+    ####### construct
+    kwargs = {
+        'K': K,
+        'obs_distrib': "Gaussian",
+        'latent_dim': latent_dim,
+        'encoder': PNP_fc_uci_encoder,
+        'decoder': fc_uci_decoder,
+        'obs_dim': obs_dim,
+        'load_model':1,
+        'decoder_path': FINETUNED_DECODER_WEIGHTS,
+        'encoder_path': ENCODER_WEIGHTS,
+    }
+    vae = PN_Plus_VAE(**kwargs)
+
+    test_loss, _, _ = vae.full_batch_loss(Data_test,mask_test)
+
+
+    print('test negative ELBO per feature: {:.2f}'
+            .format(test_loss))
+
+
+    return test_loss
+
+def impute_p_vae(Data_observed,mask_observed,Data_ground_truth,mask_target,latent_dim,batch_size,K,M):
+    '''
+    This function loads a pretrained p-VAE model, and performs imputation and returns RMSE.
+    :param Data_observed: observed data matrix
+    :param mask_observed: mask matrix that indicates the missingness of observed data. 1=observed, 0 = missing
+    :param Data_ground_truth: ground truth data for calculating RMSE performance. It can also contain missing data.
+    :param mask_target: ask matrix that indicates the missingness of ground truth data. 1=observed, 0 = missing
+    :param latent_dim: latent dimension for partial VAE model
+    :param batch_size: Mini-batch size. We evaluate test RMSE in batch mode in order to handle large dataset.
+    :param K: dimension of feature map of PNP encoder
+    :param M: number of samples used for MC sampling
+    :return: RMSE
+    '''
+    obs_dim = Data_observed.shape[1]
+    ####### construct
+    n_data = Data_observed.shape[0]
+    # mask_test_obs = 1 * (Data_observed != missing_value_indicator)  # mask of test observed entries
+    # mask_test_target = 1 * (Data_ground_truth != missing_value_indicator)  # mask of test missingness to be imputed
+    mask_test_obs = mask_observed
+    mask_test_target = mask_target
+    kwargs = {
+        'K': K,
+        'obs_distrib': "Gaussian",
+        'latent_dim': latent_dim,
+        'batch_size': batch_size,
+        'encoder': PNP_fc_uci_encoder,
+        'decoder': fc_uci_decoder,
+        'obs_dim': obs_dim,
+        'load_model': 1,
+        'decoder_path': FINETUNED_DECODER_WEIGHTS,
+        'encoder_path': ENCODER_WEIGHTS,
+        'M': M,
+    }
+    vae = PN_Plus_VAE(**kwargs)
+
+    impute_loss_RMSE = 0.
+    list_data = np.arange(n_data)
+    # np.random.shuffle(list_data)
+
+    n_it = int(np.ceil(n_data / kwargs['batch_size']))
+    # iterate through batches
+    for it in range(n_it):
+        batch_indices = list_data[it:it + min(kwargs['batch_size'], n_data - 1)]
+
+        _, impute_loss_RMSE_batch = vae.impute_losses(Data_ground_truth[batch_indices, :], mask_test_obs[batch_indices, :],
+                                                        mask_test_target[batch_indices, :])
+
+        impute_loss_RMSE += impute_loss_RMSE_batch
+
+    impute_loss_RMSE /= (n_it)
+
+
+
+    print('test impute RMSE: {:.2f}'
+            .format(impute_loss_RMSE))
+
+
+    return impute_loss_RMSE