remove uneccessary files

2020-07-22 23:04:21 -04:00 · 2020-07-22 23:04:21 -04:00 · dd1d8d3170
--- a/GPU_Docker/Dockerfile
+++ b/GPU_Docker/Dockerfile
@ -1,19 +0,0 @@
-FROM tensorflow/tensorflow:1.13.2-gpu
-
-ARG SPARK_VERSION=2.4.3
-RUN apt-get -qq update && apt-get -qq -y install curl bzip2 gcc \
-    && curl -sSL https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh \
-    && bash /tmp/miniconda.sh -bfp /usr/local \
-    && rm -rf /tmp/miniconda.sh \
-    && conda install -y python=3 \
-    && conda update conda \
-    && apt-get install openjdk-8-jre-headless -y \ 
-	&& conda install pyspark=${SPARK_VERSION} \
-    && apt-get -qq -y remove curl bzip2 \
-    && apt-get -qq -y autoremove \
-    && apt-get autoclean \
-    && rm -rf /var/lib/apt/lists/* /var/log/dpkg.log \
-    && conda clean --all --yes
-
-ENV PATH /opt/conda/bin:$PATH
-ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
--- a/README.md
+++ b/README.md
@ -1,59 +1,48 @@
-## For Evaluation
-Evaluation notebook on databricks - input 2 pickle files:
+# Mosaic

-**metadata.pkl**: list of (content, style) tuples
+## About

-**features-{model}.pkl**: 2D numpy array (data points x length of feature vector) of feature vectors
+## Architecture

-features.py writes out the 2 pkl files for resnets currently.
+## Paper

-Work done on day one of the hackathon, deploys model onto AzureML to be run from a webservice
+## Building from Scratch

-## Contents
+### Backend

-File/Folder: Description
-+ 'azureml': All code is here
-	+ '.vs'
-	+ '.vscode'
-	+ 'call.py': Regression model, run to simulate a call to the model
-	+ 'call2.py': Resnet50 model, run to simulate a call to the model **(CURRENTLY NOT WORKING)**
-	+ 'deploy.py': Resgression model, deploys the model to AzureML
-	+ 'deploy2.py': Resnet50 model, deploys the model to AzureML **(CURRENTLY NOT WORKING)**
-	+ 'deploymentConfig.yml': configuration details for deploying the model to AzureML
-	+ 'my_model.h5': Model of Resnet50, may need modification
-	+ 'myenv.yml': details about the python environment
-	+ 'panda.jpg': image of a panda used for testing
-	+ 'score.py': Regression model, handles the actual calculations when called from call.py
-	+ 'score2.py': Resnet model, handles the actual classification when called from call2.py
-	+ 'sklearn_regression_model.pkl': Regression model
-+ 'gitignore': What to ignore at commit time
-+ 'CODE_OF_CONDUCT': Code of conduct 
-+ 'LICENSE': The license for the files
-+ 'README': This README file
-+ 'SECURITY': Security information
+1. Download Image Metadata:
+    ```bash
+    wget https://mmlsparkdemo.blob.core.windows.net/cknn/metadata.json?sv=2019-02-02&st=2020-07-23T02%3A22%3A30Z&se=2023-07-24T02%3A22%3A00Z&sr=b&sp=r&sig=hDnGw9y%2BO5XlggL6br%2FPzSKmpAdUZ%2F1LJKVkcmbVmCE%3D
+    ```
+1. Download Images:
+    ```bash
+   cd data_prep
+   python download_images.py 
+   ```  
+1. Featurize and perform Conditional Image Retrieval on every image
+   ```bash
+   cd data_prep
+   python featurize_and_match.py 
+   ```
+1. Write enriched information to an Azure Search Index. 

+    More detailed code coming soon, Follow [the closely related guide](
+https://docs.microsoft.com/en-us/azure/cognitive-services/big-data/recipes/art-explorer) for a similiar example.

-## Prerequisites
+### Frontend

-Visual Studio/Visual Studio Code (if you use Visual Studio Code to set up your python environment use ctrl+shift+p, "Python: Select Interpreter") \
-Anaconda 3.7 \
-Azure Machine Learning workspace \
-Azure Machine Learning SDK (https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py) \
-Docker \
-Tensorflow \
-Keras
+1. Install `npm` if you dont already have it. You can find instructions at [https://nodejs.org/](https://nodejs.org/).
+1. Install dependencies:
+	```bash
+	cd frontend
+	npm install
+	```
+1. Start the development server:
+	```bash
+	npm start
+	```
+1. Navigate to [http://localhost:3000/art](http://localhost:3000/art) to explore the local website.

-## Setup
-
-One time run of either deploy.py or deploy2.py depending on which model is being used to deploy the model to the target
-
-## Running the sample
-
-Run call.py or call2.py depending on which model is being used to simulate a call to the AzureML model
-
-## Key concepts
-
-The general idea is that we are deploying a machine learning model to the cloud as a webservice. The regression model was mostly proof of concept and the Resnet50 model is going to be closer to what we will end up using since we will be handling images. 

 ## Contributing

--- a/azureml/README.md
+++ b/azureml/README.md
@ -1,81 +0,0 @@
-# Model Training and Inference
-
- [Model Training and Inference](#model-training-and-inference)
-  - [File Structure](#file-structure)
-  - [Getting Started](#getting-started)
-    - [Install the Python Dependencies](#install-the-python-dependencies)
-    - [Deploying Featurization Script](#deploying-featurization-script)
-  - [Training](#training)
-  - [Service Deployment](#service-deployment)
-
-Mosaic allows users to find similar artworks by featurizing artwork images using a pretrained Keras model, normalizing the resultant vector, and loading them into a ball tree to quickly query for other artwork with similar featurizations filtered by either culture or classification (m).
-
-## File Structure
-
-This folder contains various scripts and configuration files that either are deployed on Azure or automate the deployment process.
-
- `featurize.py` is deployed to Azure Machine Learning as an experiment to read image metadata from a mounted Azure Storage blob, download the images, featurize the images, and save them into a ball tree in the file system.
-
- `deploy_featurize.py` automates the deployment of `featurize.py` by mounting the storage blob, spinning up a GPU cluster, and running the experiment on the cluster. Once the experiment is complete, it registers the entire `output/` folder as a model named `mosaic_model`
-
- `score.py` is deployed to Azure Machine Learning as a web service that allows clients to query the model. An initialization function `init()` loads the model (trained by `featurize.py`) from disk and optionally asserts that Tensorflow is able to detect a GPU. In `run(request)`, we receive an `AMLRequest` object where we can read the entire request object (HTTP method, query params, etc.) to determine the inputs for running model inference. We query the model and return the list of similar artwork filtered by the request parameters.
-
- `deploy_score_local.py` runs an instance of `score.py` for local debugging. It builds the Docker image and saves it in the local Docker images. It then attempts to run the Docker container and gives the user a URL to access the inference server if successful. Runtime ranges from 5-20 minutes.
-
- `deploy_score_aks.py` runs an instance of `score.py` in an AKS cluster. It attempts to attach to a cluster and service if already running, otherwise it creates a service on an existing or new cluster. It then deploys the model and script onto the cluster. Runtime ranges from 10-20 minutes.
-
- `./GPU_Docker/Dockerfile` is a Dockerfile that specifies how to build the base image for training and scoring. It includes `tensorflow-gpu` for GPU drivers, `Java` for `pyspark`, and an installation of `Anaconda`. This Dockerfile has been built and hosted on [DockerHub](https://hub.docker.com/repository/docker/typingkoala/mosaic_base_image) in the repo `typingkoala/mosaic_base_image`.
-
- `call_service.py` is a script that makes a post request to our web service, printing the response.
-
-## Getting Started
-
-In order to deploy Mosaic, you will need the following installed on your computer.
-
- Python 3
- Docker
-
-### Install the Python Dependencies
-
-First, install the AzureML Python SDK. Make sure to activate your virtual environment if you are using one.
-
-```bash
-pip install --upgrade azureml-sdk
-```
-
-### Deploying Featurization Script
-In order to begin online featurization, we first edit the `deploy_featurize.py` script with the appropriate workspace and Azure Storage information. On the first run, you will be prompted to log in to Microsoft using interactive authentication. Once completed, your authentication information will be cached locally for future runs.
-
-```bash
-python azureml/deploy_featurize.py
-```
-
-Running `deploy_featurize.py` will attach to a cluster (or create one if it doesn't exist) with the name specified in the script. It will then submit the `featurize.py` script as a job to complete. Logs will stream from the cluster to the local terminal. Once the script runs, the `outputs/` folder will be registered as a model so that it can be mounted to the inference cluster for serving web traffic.
-
-
-## Training
-
-The Ball Tree API originates from [MMLSpark](https://github.com/Azure/mmlspark). It allows for the initialization of a conditional ball tree with three methods: `findMaximumInnerProducts`, `save`, and `load`.
-The featurization of the images and the creation of the balltrees is done in `featurize.py`. The file reads a csv from a mounted storage blob and downloads the images from the provided urls. The images are featurized using the embeddings from ResNet50 and then used to create balltree objects.
-
-The training is run through `deploy_featurize.py` as an experiment on Azure Machine Learning (AML). It mounts the storage blob for `featurize.py`, submits the run, then saves the balltree objects and metadata in a model to be referenced later.
-
-This can be run either through AML training clusters or locally to speed up the dev loop. Make sure the [workspace settings](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) are correct before running. The settings of the cluster can be altered in provisioning_config, such as vm size and number of nodes. Setting min_nodes = 0 will allow the cluster to scale to 0 nodes when not in use. To run locally, the container can be downloaded from Azure Container Registry, but only after you run it through AML. The repository URL can be found through Azure Container Registry, and will resemble `extenamls.azurecr.io/azureml/azureml_0062a8f080ece0d27d:latest`.
-
-```bash
-docker run -d -it --name <name> --mount type=bind,source=<source_directory>,target=/app <repository_url>
-```
-
-The docker exec command will enable debugging through the docker bash terminal
-
-```bash
-docker exec -it <name> bash
-```
-
-## Service Deployment
-
-`score.py` is a web service that allows for clients to query our model. It handles GET requests, expecting the following parameters: `url`, `n`, `culture` or `classification.ation`. It loads the balltrees and metadata pickle created in `featurize.py`, then downloads the provided URL and featurizes it. The featurized image is put into either the culture or classifcation balltree along with the number of results desired, returning the closest matches. The metadata for the results is then sent as a serialized JSON object.
-
-The web service is deployed through `deploy_score_aks.py` to an inference cluster on Azure Kupernetes Service. It tries to first update an existing service, but if that fails it will create either a new service or a new cluster and service.
-
-The service can be deployed to a cluster or locally. Make sure the [workspace settings](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) are correct before running. The settings for the inference cluster can be changed in gpu_aks_config. To deploy it locally, run `deploy_score_local.py`.
--- a/azureml/call_service.py
+++ b/azureml/call_service.py
@ -1,15 +0,0 @@
-import requests
-import json
-
-resp = requests.post(
-    "https://extern2020apim.azure-api.net/cknn/",
-    json={
-    "url":"https://mmlsparkdemo.blob.core.windows.net/rijks/resized_images/AK-BR-324.jpg",
-    "n":5,
-    "query":"prints"})
-
-print(resp.text)
-
-response_data = json.loads(resp.content, encoding="utf-8")
-if "results" not in response_data.keys():
-    raise Exception("FAILED: No results field.")
--- a/azureml/deploy_featurize.py
+++ b/azureml/deploy_featurize.py
@ -1,62 +0,0 @@
-import os
-
-from azureml.core import Datastore
-from azureml.core import Experiment, Workspace
-from azureml.core.compute import ComputeTarget
-from azureml.core.compute.amlcompute import AmlComputeProvisioningConfiguration
-from azureml.train.estimator import Estimator
-
-ws = Workspace(
-    subscription_id="f9b96b36-1f5e-4021-8959-51527e26e6d3",
-    resource_group="marhamil-mosaic",
-    workspace_name="mosaic-aml"
-)
-
-datastore = Datastore.register_azure_blob_container(
-    workspace=ws,
-    datastore_name='mosaic_datastore',
-    container_name='mosaic',
-    account_name='mmlsparkdemo',
-    sas_token="?sv=2019-02-02&ss=bf&srt=sco&sp=rlc&se=2030-01-23T04:14:29Z"
-              "&st=2020-01-22T20:14:29Z&spr=https,http&sig=nPlKziG9ppu4Vt5"
-              "b6G%2BW1JkxHYZ1dlm39mO2fMZlET4%3D",
-    create_if_not_exists=True)
-
-cluster_name = "training-4"
-try:
-    # Connecting to pre-existing cluster
-    compute_target = ComputeTarget(ws, cluster_name)
-    print("Found existing cluster...")
-except:
-    # Create a new cluster to train on
-    provisioning_config = AmlComputeProvisioningConfiguration(
-        vm_size="Standard_D4_v2",
-        min_nodes=0,
-        max_nodes=1
-    )
-    compute_target = ComputeTarget.create(ws, cluster_name, provisioning_config)
-compute_target.wait_for_completion(show_output=True)
-
-# Create and run the experiment
-exp = Experiment(workspace=ws, name='featurize_artwork_4')
-
-estimator = Estimator(
-    source_directory=".",
-    entry_script="featurize.py",
-    script_params={
-        "--data-dir": datastore.as_mount()
-    },
-    conda_dependencies_file=os.path.join(os.path.dirname(os.path.realpath(__file__)), "myenv.yml"),
-    use_docker=True,
-    custom_docker_image="typingkoala/mosaic_base_image:1.0.0",
-    compute_target=compute_target
-)
-
-run = exp.submit(estimator)
-run.wait_for_completion(show_output=True)
-
-# Save the balltrees made in score.py and metadata
-run.register_model(
-    model_name="mosaic_model_4",
-    model_path="outputs/"
-)
--- a/azureml/deploy_rationale_local.py
+++ b/azureml/deploy_rationale_local.py
@ -1,28 +0,0 @@
-from azureml.core import Workspace
-from azureml.core.model import InferenceConfig, Model
-from azureml.core.webservice import LocalWebservice
-
-# Set your workspace
-ws = Workspace(
-    subscription_id="ce1dee05-8cf6-4ad6-990a-9c80868800ba",
-    resource_group="extern2020",
-    workspace_name="exten-amls"
-)
-
-# Settings for deployment
-inference_config = InferenceConfig(
-    entry_script="rationale.py",
-    runtime="python",
-    source_directory=".",
-    conda_file="myenv.yml",
-    base_image="typingkoala/mosaic_base_image:1.0.0")
-
-# Load existing model
-model = Model(ws, name="mosaic_model")
-
-# Deploy model locally
-deployment_config = LocalWebservice.deploy_configuration(port=8890)
-service = Model.deploy(ws, "rationalizing", [model], inference_config, deployment_config)
-
-service.wait_for_deployment(show_output = True)
-print(service.state)
--- a/azureml/deploy_score_aks.py
+++ b/azureml/deploy_score_aks.py
@ -1,76 +0,0 @@
-from azureml.core import Workspace
-from azureml.core.compute import AksCompute, ComputeTarget
-from azureml.core.compute_target import ComputeTargetException
-from azureml.core.model import InferenceConfig, Model
-from azureml.core.webservice import AksWebservice
-from azureml.exceptions import WebserviceException
-
-ws = Workspace(
-    subscription_id="f9b96b36-1f5e-4021-8959-51527e26e6d3",
-    resource_group="marhamil-mosaic",
-    workspace_name="mosaic-aml"
-)
-
-inference_config = InferenceConfig(
-    entry_script="score.py",
-    runtime="python",
-    source_directory=".",
-    conda_file="myenv.yml",
-    base_image="typingkoala/mosaic_base_image:1.0.0")
-
-resource_group = 'extern2020'
-cluster_name = 'aks-gpu'
-service_name = 'artgpuservice'
-
-"""
-Creates a cluster if one by the name of cluster_name does not already exist.
-Deploys a service to the cluster if one by the name of service_name does not already exist, otherwise it will update the existing service.
-"""
-try: # If cluster and service exists
-    aks_target = AksCompute(ws, cluster_name)
-    service = AksWebservice(name=service_name, workspace=ws)
-    # print(service.get_logs(num_lines=5000))
-    print("Updating existing service: {}".format(service_name))
-    service.update(inference_config=inference_config, auth_enabled=False)
-    service.wait_for_deployment(show_output=True)
-
-except WebserviceException: # If cluster but no service
-    # Creating a new service
-    aks_target = AksCompute(ws, cluster_name)
-    print("Deploying new service: {}".format(service_name))
-    gpu_aks_config = AksWebservice.deploy_configuration(
-        autoscale_enabled=False,
-        num_replicas=1,
-        cpu_cores=2,
-        memory_gb=16,
-        auth_enabled=False)
-    service = Model.deploy(ws, service_name, [], inference_config, gpu_aks_config, aks_target, overwrite=True)
-    service.wait_for_deployment(show_output = True)
-
-except ComputeTargetException: # If cluster doesn't exist
-    print("Creating new cluster: {}".format(cluster_name))
-    # Provision AKS cluster with GPU machine
-    prov_config = AksCompute.provisioning_configuration(
-        vm_size="Standard_NC6",
-        cluster_purpose=AksCompute.ClusterPurpose.DEV_TEST)
-
-    # Create the cluster
-    aks_target = ComputeTarget.create(
-        workspace=ws, name=cluster_name, provisioning_configuration=prov_config, 
-    )
-    aks_target.wait_for_completion(show_output=True)
-
-    print("Deploying new service: {}".format(service_name))
-    gpu_aks_config = AksWebservice.deploy_configuration(
-        autoscale_enabled=False,
-        num_replicas=1,
-        cpu_cores=2,
-        memory_gb=16,
-        auth_enabled=False)
-    service = Model.deploy(ws, service_name, [], inference_config, gpu_aks_config, aks_target, overwrite=True)
-    service.wait_for_deployment(show_output = True)
-
-
-print("State: " + service.state)
-print("Scoring URI: " + service.scoring_uri)
-
--- a/azureml/deploy_score_local.py
+++ b/azureml/deploy_score_local.py
@ -1,28 +0,0 @@
-from azureml.core import Workspace
-from azureml.core.model import InferenceConfig, Model
-from azureml.core.webservice import LocalWebservice
-
-# Set your workspace
-ws = Workspace(
-    subscription_id="ce1dee05-8cf6-4ad6-990a-9c80868800ba",
-    resource_group="extern2020",
-    workspace_name="exten-amls"
-)
-
-# Settings for deployment
-inference_config = InferenceConfig(
-    entry_script="score.py",
-    runtime="python",
-    source_directory=".",
-    conda_file="myenv.yml",
-    base_image="typingkoala/mosaic_base_image:1.0.0")
-
-# Load existing model
-model = Model(ws, name="mosaic_model")
-
-# Deploy model locally
-deployment_config = LocalWebservice.deploy_configuration(port=8890)
-service = Model.deploy(ws, "scoring", [model], inference_config, deployment_config)
-
-service.wait_for_deployment(show_output = True)
-print(service.state)
--- a/azureml/featurize.py
+++ b/azureml/featurize.py
@ -1,316 +0,0 @@
-import argparse
-import os
-import pickle
-import urllib.request
-from azureml.core import Run, Workspace
-from multiprocessing import Pool
-import numpy as np
-import pandas as pd
-import tensorflow as tf
-from PIL import Image
-from keras.applications.resnet50 import ResNet50
-from keras.applications.resnet50 import preprocess_input
-from pyspark.sql import SparkSession
-from pyspark import SparkContext, SQLContext
-
-
-# Initialize
-batch_size = 512
-img_width = 225
-img_height = 225
-model = "resnet"
-os.environ["CUDA_VISIBLE_DEVICES"] = str(0)
-
-# gets mount location from aml_feat.py, passed in as args
-parser = argparse.ArgumentParser()
-parser.add_argument("--data-dir", type=str, dest="data_folder")
-data_folder = parser.parse_args().data_folder
-tsv_path = os.path.join(data_folder, "met_rijks_metadata.tsv")
-
-# create file paths for saving balltree later
-output_root = './outputs'
-features_culture_fn = os.path.join(output_root, 'features_culture.ball')
-features_classification_fn = os.path.join(output_root, 'features_classification.ball')
-metadata_fn = os.path.join(output_root, 'metadata.pkl')
-
-#cached_features_url = "https://mmlsparkdemo.blob.core.windows.net/mosaic/features_and_successes.pkl"
-#cached_features_fn = 'features_and_successes.pkl'
-cached_features_url = "https://mmlsparkdemo.blob.core.windows.net/mosaic/met_and_rijks_art_with_features.parquet.zip"
-cached_features_fn = "met_and_rijks_art_with_features.parquet.zip"
-parquet_fn = "met_and_rijks_art_with_features.parquet"
-
-write_to_index = False
-
-# downloading java dependencies
-print(os.environ.get("JAVA_HOME", "WARN: No Java home found"))
-spark = SparkSession.builder \
-    .master("local[*]") \
-    .appName("TestConditionalBallTree") \
-    .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1-38-a6970b95-SNAPSHOT") \
-    .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
-    .config("spark.executor.heartbeatInterval", "60s") \
-    .config("spark.driver.memory", "32g") \
-    .config("spark.driver.maxResultSize", "8g") \
-    .getOrCreate()
-
-from mmlspark.nn.ConditionalBallTree import ConditionalBallTree
-
-
-# Featurize
-def batch(iterable, n):
-    """
-    Splits iterable into nested array with inner size n
-    """
-    current_batch = []
-    for item in iterable:
-        if item is not None:
-            current_batch.append(item)
-            if len(current_batch) == n:
-                yield current_batch
-                current_batch = []
-    if current_batch:
-        yield current_batch
-
-
-def parallel_apply(df, func, n_cores=16):
-    df_split = np.array_split(df, n_cores)
-    pool = Pool(n_cores)
-    df = pd.concat(pool.map(func, df_split))
-    pool.close()
-    pool.join()
-    return df
-
-
-def retry(func, args, times):
-    if times == 0:
-        return False
-    else:
-        try:
-            func(args)
-            return True
-        except Exception as e:
-            print(e)
-            retry(func, args, times - 1)
-
-
-def download_image_inner(metadata_row):
-    """
-    Download an image from the given url and save it to disk as filename {museum}_{id}{extension}
-    where the extension is inferred from content type. Returns None if download fails twice.
-    """
-    url = str(metadata_row["Thumbnail_Url"])
-    museum = str(metadata_row["Museum"])
-    local_file = "images/" + museum + "_" + url.split("/")[-1]
-    if not os.path.exists(local_file):
-        urllib.request.urlretrieve(url, local_file)
-
-
-def download_image(metadata_row):
-    return retry(download_image_inner, metadata_row, 3)
-
-
-def download_image_df(df):
-    df["Success"] = df.apply(download_image, axis=1)
-    return df
-
-
-def load_images(rows, successes):
-    """Given an array of images, return a numpy array of PIL image data
-    after reading the image's filename from disk.
-    If successful, append the image dict (with metadata) to the
-    provided metadata list, respectively.
-
-    Arguments:
-        images {dict[]} -- array of dictionaries that represent images of artwork
-        metadata {dict[]} -- array of image objects to append to if reading is successful
-
-    Returns:
-        Image[] -- an array of Pillow images
-    """
-    batch = []
-    for i, row in rows:
-        filename = "images/" + row["Museum"] + "_" + row["Thumbnail_Url"].split("/")[-1]
-        try:
-            batch.append(load_image(filename))
-            successes.append(row)
-        except Exception as e:
-            print(e)
-            print("Failed to load image: " + filename)
-    return np.array(batch)
-
-
-def load_image(filename):
-    """
-    Given a filename of a musueum image, make sure that it is an RGB image and resize and preprocess the image.
-
-    Arguments:
-        filename {str} -- filename of the image to preprocess
-
-    Returns:
-        img {Image}-- an Image object that has been turned into an RGB image, resized, and preprocessed
-    """
-    img = Image.open(filename)
-    # non RGB images won't have the right number of channels
-    if img.mode != 'RGB':
-        img = img.convert('RGB')
-    # re-size, expand dims and run through the ResNet50 model
-    img = np.array(img.resize((img_width, img_height)))
-    img = preprocess_input(img.astype(np.float))
-    return img
-
-
-def assert_gpu():
-    """
-    This function will raise an exception if a GPU is not available to tensorflow.
-    """
-    device_name = tf.test.gpu_device_name()
-    if device_name != '/device:GPU:0':
-        raise SystemError('GPU device not found')
-    print('Found GPU at: {}'.format(device_name))
-
-if not os.path.exists(tsv_path):
-    urllib.request.urlretrieve("https://mmlsparkdemo.blob.core.windows.net/mosaic/met_rijks_metadata.tsv",
-                               tsv_path)
-
-metadata = pd.read_csv(tsv_path, delimiter="\t", keep_default_na=False)
-metadata.fillna('')  # replace nan values with empty string
-
-if write_to_index:
-    ws = Workspace(
-        subscription_id="ce1dee05-8cf6-4ad6-990a-9c80868800ba",
-        resource_group="extern2020",
-        workspace_name="exten-amls"
-    )
-    keyvault = ws.get_default_keyvault()
-    run = Run.get_context()
-    subscription_key = keyvault.get_secret(name="subscriptionKey")
-    image_subscription_key = keyvault.get_secret(name="imageSubscriptionKey")
-    from mmlspark.cognitive import AnalyzeImage
-    from mmlspark.stages import SelectColumns
-    import base64
-
-
-    def url_encode_id(idval):
-        return base64.b64encode(bytes(idval, "UTF-8")).decode("utf-8")
-
-
-    describeImage = (AnalyzeImage()
-                     .setSubscriptionKey(image_subscription_key)
-                     .setLocation("eastus")
-                     .setImageUrlCol("Thumbnail_Url")
-                     .setOutputCol("RawImageDescription")
-                     .setErrorCol("Errors")
-                     .setVisualFeatures(["Categories", "Tags", "Description", "Faces", "ImageType", "Color", "Adult"])
-                     .setConcurrency(5))
-
-    df = spark.createDataFrame(metadata)
-    df2 = describeImage.transform(df) \
-        .select("*", "RawImageDescription.*").drop("Errors", "RawImageDescription").cache()
-    df2.coalesce(3).writeToAzureSearch(
-        subscriptionKey=subscription_key,
-        actionCol="searchAction",
-        serviceName="extern-search",
-        indexName="merged-art-search-6",
-        keyCol="id",
-        batchSize="1000"
-    )
-
-
-if cached_features_url is not None:
-    if not os.path.exists(cached_features_fn):
-        urllib.request.urlretrieve(cached_features_url, cached_features_fn)
-    if not os.path.exists(parquet_fn):
-        print("extracting")
-        from zipfile import ZipFile
-        with ZipFile(cached_features_fn, 'r') as zipObj:
-            zipObj.extractall()
-        print("done extracting")
-
-    #with open(cached_features_fn, "rb") as f:
-    #    [features, successes] = pickle.load(f)
-    #print("Loaded cached features")
-else:
-    # create directory for downloading images, then download images simultaneously
-    print("Downloading images...")
-    os.makedirs("images", exist_ok=True)
-    metadata = parallel_apply(metadata, download_image_df)
-    metadata = metadata[metadata["Success"].fillna(False)]  # filters out unsuccessful rows
-    batches = list(batch(metadata.iterrows(), batch_size))
-    successes = []  # clear metadata, images are only here if they are loaded from disk
-    data_iterator = (load_images(batch, successes) for batch in batches)
-
-    # featurize the images then normalize them
-    keras_model = ResNet50(
-        input_shape=[img_width, img_height, 3],
-        weights='imagenet',
-        include_top=False,
-        pooling='avg'
-    )
-
-    assert_gpu()  # raises exception if gpu is not available
-    features = keras_model.predict_generator(data_iterator, steps=len(batches), verbose=1)
-    features /= np.linalg.norm(features, axis=1).reshape(len(successes), 1)
-
-    with open(cached_features_fn, "wb+") as f:
-        pickle.dump([features, successes], f)
-
-# print(features.shape)
-#
-# from py4j.java_collections import ListConverter
-#
-# # convert to list and then create the two balltrees for culture and classification(medium)
-# ids = [row["id"] for row in successes]
-# features = features.tolist()
-# print("fitting culture ball tree")
-#
-# converter = ListConverter()
-# gc = SparkContext._active_spark_context._jvm._gateway_client
-# from pyspark.ml.linalg import Vectors, VectorUDT
-#
-# java_features = converter.convert(features,gc)
-# java_cultures = converter.convert([row["Culture"] for row in successes], gc)
-# java_classifications = converter.convert([row["Classification"] for row in successes], gc)
-# java_values = converter.convert(ids, gc)
-#
-# cbt_culture = ConditionalBallTree(java_features, java_values, java_cultures, 50)
-# print("fitting class ball tree")
-#
-# cbt_classification = ConditionalBallTree(java_features, java_values, java_classifications, 50)
-# print("fit culture ball tree")
-#
-# # save the balltrees to output directory and pickle the museum and id metadata
-# os.makedirs(output_root, exist_ok=True)
-# cbt_culture.save(features_culture_fn)
-# cbt_classification.save(features_classification_fn)
-# pickle.dump(successes, open(metadata_fn, 'wb+'))
-
-from mmlspark.nn import *
-
-df = spark.read.parquet(parquet_fn)
-
-cols_to_group = df.columns
-cols_to_group.remove("Norm_Features")
-
-from pyspark.sql.functions import struct
-df2 = df.withColumn("Meta", struct(*cols_to_group))
-
-cknn_classification = (ConditionalKNN()
-  .setOutputCol("Matches")
-  .setFeaturesCol("Norm_Features")
-  .setValuesCol("Meta")
-  .setLabelCol("Classification")
-  .fit(df2))
-cbt_classification = ConditionalBallTree(None, None, None, None, cknn_classification._java_obj.getBallTree())
-
-cknn_culture = (ConditionalKNN()
-  .setOutputCol("Matches")
-  .setFeaturesCol("Norm_Features")
-  .setValuesCol("Meta")
-  .setLabelCol("Culture")
-  .fit(df2))
-cbt_culture = ConditionalBallTree(None, None, None, None, cknn_culture._java_obj.getBallTree())
-
-os.makedirs(output_root, exist_ok=True)
-cbt_culture.save(features_culture_fn)
-cbt_classification.save(features_classification_fn)
-
--- a/azureml/myenv.yml
+++ b/azureml/myenv.yml
@ -1,11 +0,0 @@
-name: project_environment
-dependencies:
-  - python=3.6.2
-  - tensorflow-gpu
-  - numpy
-  - keras
-  - Pillow
-  - pyspark
-  - pip:
-    - azureml-defaults[services]
-    - tqdm
--- a/azureml/rationale.py
+++ b/azureml/rationale.py
@ -1,160 +0,0 @@
-import os
-from io import BytesIO
-import json
-import os
-import traceback
-from io import BytesIO
-
-import numpy as np
-import requests
-import tensorflow as tf
-from PIL import Image
-from azureml.contrib.services.aml_request import rawhttp
-from azureml.contrib.services.aml_response import AMLResponse
-from azureml.core.model import Model
-from keras.applications.resnet50 import ResNet50, preprocess_input
-from pyspark.sql import SparkSession
-
-import keras
-from keras.preprocessing import image
-from skimage.io import imread
-import matplotlib.pyplot as plt
-import shap
-import sys
-from io import BytesIO
-from zipfile import ZipFile
-from urllib.request import urlopen
-import random
-from keras.layers import Input, Lambda
-from keras import Model
-import keras.backend as K
-from scipy.ndimage.filters import gaussian_filter
-
-def prep_image(url, preprocess=False):
-  response = requests.get(url)
-  img = Image.open(BytesIO(response.content))
-  if img.mode != 'RGB':
-      img = img.convert('RGB')
-  x = np.array(img.resize((224, 224)))
-  x = np.expand_dims(x, axis=0)
-  if preprocess:  
-    return preprocess_input(x), img.size
-  return x, img.size
-
-def setup_model():
-  keras_model = ResNet50(input_shape=[224, 224, 3], weights='imagenet', include_top=False, pooling='avg')
-  im1 = Input([224, 224, 3])
-  f1 = keras_model(im1)
-  return keras_model, im1, f1
-
-def inv_logit(y):
-    return tf.math.log(y/(1-y))
-
-def precompute(original_url):
-  training_img, training_img_size = prep_image(original_url)
-  x_train = np.array(gaussian_filter(training_img, sigma=4))
-  
-  query_img, query_img_size = prep_image(url, True)
-  im2_const = tf.constant(query_img, dtype=tf.float32)
-  im2 = Lambda(lambda im1: im2_const)(im1)
-
-  f2 = keras_model(im2)
-  d = keras.layers.Dot(1, normalize=True)([f1, f2])
-
-  logit = Lambda(lambda d: inv_logit((d+1)/2))(d)
-  model = Model(inputs=[im1], outputs=[logit])
-  
-  e = shap.DeepExplainer(model, x_train)
-  return e
-
-def test_match(match_url, e):
-  test_image, size = prep_image(match_url)
-  x_test = np.array(test_image)
-  shap_values = e.shap_values(x_test, check_additivity=False)
-  shap_values_normed = np.array(shap_values)
-  shap_values_normed = np.linalg.norm(shap_values_normed, axis=4)
-  
-  blurred = gaussian_filter(shap_values_normed[0], sigma=4)
-  bflat = blurred.flatten()
-  shap_values_mask_qi = np.where(np.array(blurred) > np.mean(bflat) + np.std(bflat), 1, 0).reshape(224, 224, 1)
-  shap_values_qi = np.multiply(shap_values_mask_qi, x_test[0])
-  
-  new_size = (224, int(size[1]/size[0]*224)) if size[0] > size[1] else (int(size[0]/size[1]*224), 224)
-  original_size = Image.fromarray(shap_values_qi.astype(np.uint8), 'RGB').resize(new_size)
-  
-  return original_size
-
-# # run before any queries
-# keras_model, im1, f1 = setup_model()
-
-# # run only once for each original image
-# original_url = "https://mmlsparkdemo.blob.core.windows.net/cknn/datasets/interpret/lex1.jpg" # replace with link to original image
-# e = precompute(original_url)
-
-# # run for each new matched image
-# match_url = "https://mmlsparkdemo.blob.core.windows.net/cknn/datasets/interpret/lex2.jpg" # replace with link to matched image
-# explained_pic = test_match(match_url, e)
-
-def assert_gpu():
-    """
-    This function will raise an exception if a GPU is not available to tensorflow.
-    """
-    device_name = tf.test.gpu_device_name()
-    if device_name != '/device:GPU:0':
-        raise SystemError('GPU device not found')
-    print('Found GPU at: {}'.format(device_name))
-
-
-def init():
-    global keras_model, im1, f1
-    keras_model, im1, f1 = setup_model()
-    print('yay')
-
-def error_response(err_msg):
-    """Returns an error response for a given error message
-
-    Arguments:
-        err_msg {str} -- error message
-
-    Returns:
-        AMLResponse -- response object for the error
-    """
-    resp = AMLResponse(json.dumps({"error": err_msg}), 400)
-    resp.headers['Access-Control-Allow-Origin'] = "*"
-    resp.headers['Content-Type'] = "application/json"
-    return resp
-
-
-def success_response(content):
-    """Returns a success response with the given content
-
-    Arguments:
-        content {any} -- any json serializable data type to send to
-        the client
-
-    Returns:
-        AMLResponse -- response object for the success
-    """
-    resp = AMLResponse(json.dumps({"results": content}), 200)
-    resp.headers['Access-Control-Allow-Origin'] = "*"
-    resp.headers['Content-Type'] = "application/json"
-    return resp
-
-
-@rawhttp
-def run(request):
-    global e
-    print(request)
-    if request.method == 'POST':
-        try:
-            print("HELLO")
-            request_data = json.loads(request.data.decode('utf-8'))
-            if request_data['original']: #if original!=None
-                e = precompute(request_data['original'])
-            explained_pic = test_match(request_data['match'],e)
-            return success_response(explained_pic)
-        except Exception as err:
-            traceback.print_exc()
-            return error_response(str(err))
-    else:  # unsupported http method
-        return error_response("invalid http request method")
--- a/azureml/score.py
+++ b/azureml/score.py
@ -1,205 +0,0 @@
-import os
-from io import BytesIO
-import json
-import os
-import traceback
-from io import BytesIO
-
-import numpy as np
-import requests
-import tensorflow as tf
-from PIL import Image
-from azureml.contrib.services.aml_request import rawhttp
-from azureml.contrib.services.aml_response import AMLResponse
-from keras.applications.resnet50 import ResNet50, preprocess_input
-from pyspark.sql import SparkSession
-
-ALL_CLASSIFICATIONS = {'prints', 'drawings', 'ceramics', 'textiles', 'paintings', 'accessories', 'photographs', "glass",
-                       "metalwork", "sculptures", "weapons", "stone", "precious", "paper", "woodwork", "leatherwork",
-                       "musical instruments", "uncategorized"}
-
-ALL_CULTURES = {'african (general)', 'american', 'ancient american', 'ancient asian', 'ancient european',
-                'ancient middle-eastern', 'asian (general)',
-                'austrian', 'belgian', 'british', 'chinese', 'czech', 'dutch', 'egyptian', 'european (general)',
-                'french',
-                'german', 'greek',
-                'iranian', 'italian', 'japanese', 'latin american', 'middle eastern', 'roman', 'russian', 'south asian',
-                'southeast asian',
-                'spanish', 'swiss', 'various'}
-
-
-def assert_gpu():
-    """
-    This function will raise an exception if a GPU is not available to tensorflow.
-    """
-    device_name = tf.test.gpu_device_name()
-    if device_name != '/device:GPU:0':
-        raise SystemError('GPU device not found')
-    print('Found GPU at: {}'.format(device_name))
-
-
-def init():
-    global culture_model
-    global classification_model
-    global metadata
-    global keras_model
-
-    os.environ["CUDA_VISIBLE_DEVICES"] = str(0)
-    assert_gpu()
-    print("Initializing Spark")
-
-    # downloading java dependencies
-    print(os.environ.get("JAVA_HOME", "WARN: No Java home found"))
-    SparkSession.builder \
-        .master("local[*]") \
-        .appName("TestConditionalBallTree") \
-        .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1-38-a6970b95-SNAPSHOT") \
-        .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
-        .config("spark.driver.memory", "32g") \
-        .config("spark.executor.heartbeatInterval", "60s") \
-        .getOrCreate()
-
-    print("Spark Initialized")
-
-
-    from mmlspark.nn.ConditionalBallTree import ConditionalBallTree
-
-
-    print("Downloading Models")
-
-    if not os.path.exists("medium.ball"):
-        print("Downloading medium")
-        os.system('wget https://mmlsparkdemo.blob.core.windows.net/mosaic/medium.ball')
-        print("downloaded medium")
-
-    if not os.path.exists('culture.ball'):
-        print("Downloading culture")
-        os.system('wget https://mmlsparkdemo.blob.core.windows.net/mosaic/culture.ball')
-        print("downloaded culture")
-
-    # initialize the model architecture and load in imagenet weights
-
-    culture_model = ConditionalBallTree.load('culture.ball')
-    classification_model = ConditionalBallTree.load('medium.ball')
-
-    # Model for featurizing
-    keras_model = ResNet50(
-        input_shape=[225, 225, 3],
-        weights='imagenet',
-        include_top=False,
-        pooling='avg'
-    )
-
-
-def get_similar_images(img, culture=None, classification=None, n=5):
-    """Return an n-size array of image objects similar to the pillow image provided
-    using the culture or classification as a filter. If no filter is given, it filters on
-    all known classifications.
-
-    Arguments:
-        img {Image} -- Pillow image to compare to
-        culture {str} -- string of the culture to filter
-        classification {str} -- string of the classification to filter
-        n {int} -- number of results to return
-
-    Returns:
-        dict[] -- array of dictionaries representing artworks that are similar
-    """
-    # Non RGB images won't have the right number of channels
-    if img.mode != 'RGB':
-        img = img.convert('RGB')
-    img = np.array(img)  # PIL -> numpy
-    img = np.expand_dims(img, axis=0)
-    img = preprocess_input(img.astype(np.float))
-
-    features = keras_model.predict(img)  # featurize
-    features /= np.linalg.norm(features)
-    img_feature = features[0]
-    img_feature = img_feature.tolist()
-
-    # Get results based upon the filter provided
-    if culture is not None:
-        result = culture_model.findMaximumInnerProducts(
-            img_feature,
-            {culture},
-            n
-        )
-        selected_model = culture_model
-    elif classification is not None:
-        result = classification_model.findMaximumInnerProducts(
-            img_feature,
-            {classification},
-            n
-        )
-        selected_model = classification_model
-    else:
-        result = classification_model.findMaximumInnerProducts(
-            img_feature,
-            ALL_CLASSIFICATIONS,
-            n
-        )
-        selected_model = classification_model
-
-    results_with_data = []
-    for r in result:
-        row = selected_model._jconditional_balltree.values().apply(r[0])
-        dist = r[1]
-        results_with_data.append([json.loads(row), dist])
-    return results_with_data
-
-
-def error_response(err_msg):
-    """Returns an error response for a given error message
-
-    Arguments:
-        err_msg {str} -- error message
-
-    Returns:
-        AMLResponse -- response object for the error
-    """
-    resp = AMLResponse(json.dumps({"error": err_msg}), 400)
-    resp.headers['Access-Control-Allow-Origin'] = "*"
-    resp.headers['Content-Type'] = "application/json"
-    return resp
-
-
-def success_response(content):
-    """Returns a success response with the given content
-
-    Arguments:
-        content {any} -- any json serializable data type to send to
-        the client
-
-    Returns:
-        AMLResponse -- response object for the success
-    """
-    resp = AMLResponse(json.dumps({"results": content}), 200)
-    resp.headers['Access-Control-Allow-Origin'] = "*"
-    resp.headers['Content-Type'] = "application/json"
-    return resp
-
-
-@rawhttp
-def run(request):
-    print(request)
-    if request.method == 'POST':
-        try:
-            request_data = json.loads(request.data.decode('utf-8'))
-            response = requests.get(request_data['url'])  # URL -> response
-            img = Image.open(BytesIO(response.content)).resize((225, 225))  # response -> PIL
-            query = request_data.get('query', None)
-            culture = query if query in ALL_CULTURES else None
-            classification = query if query in ALL_CLASSIFICATIONS else None
-            similar_images = get_similar_images(
-                img,
-                culture=culture,
-                classification=classification,
-                n=int(request_data['n'])
-            )
-            return success_response(similar_images)
-        except Exception as err:
-            traceback.print_exc()
-            return error_response(str(err))
-
-    else:  # unsupported http method
-        return error_response("invalid http request method")
--- a/backend/application.py
+++ b/backend/application.py
@ -5,11 +5,11 @@ import base64
 import os
 import subprocess
 import sys
-from io import BytesIO

 def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

+
 install("azure-storage-blob")
 install("flask-cors")
 from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, ContentSettings
@ -24,48 +24,53 @@ container_name = "mosaic-shares"
 BAD_REQUEST_STATUS_CODE = 400
 NOT_FOUND_STATUS_CODE = 404

+
 def allowed_file(filename):
    return '.' in filename and \
-        filename.rsplit('.', 1)[1].lower() in {'png', 'jpg', 'jpeg'}
+           filename.rsplit('.', 1)[1].lower() in {'png', 'jpg', 'jpeg'}
+

 app = Flask(__name__)
 CORS(app)

+
@app.route('/', methods=['GET'])
 def home():
-    return jsonify({ "status": "ok", "version": "1.0.0" })
+    return jsonify({"status": "ok", "version": "1.0.0"})
+

@app.route('/upload', methods=['POST'])
 def upload():
    # frontend uploads image, we save to azure storage blob and return a link to the image and the share page
    if request.method == 'POST':
        if request.args.get("filename") is None:
-            return jsonify({ "error": "filename parameter must be specified" })
+            return jsonify({"error": "filename parameter must be specified"})
        filename = request.args.get("filename")
        content_type = None
        try:
            img_b64 = request.form.get('image').split(',')
            image = base64.b64decode(img_b64[1])
-            content_type = img_b64[0].split(':')[1].split(';')[0] # gets content type from data:image/png;base64
+            content_type = img_b64[0].split(':')[1].split(';')[0]  # gets content type from data:image/png;base64
        except:
-            return jsonify({ "error": "unable to decode"})
+            return jsonify({"error": "unable to decode"})

        if allowed_file(filename):
            filename = secure_filename(filename)
            blob_client = blob_service_client.get_blob_client(container=container_name, blob=filename)
            try:
                blob_client.upload_blob(image)
-                blob_client.set_http_headers(content_settings = ContentSettings(content_type=content_type))
+                blob_client.set_http_headers(content_settings=ContentSettings(content_type=content_type))
                print(content_type)
            except Exception as err:
                print(err)
            finally:
                img_url = "https://mmlsparkdemo.blob.core.windows.net/mosaic-shares/mosaic-shares/" + filename
-                return jsonify({ "img_url": img_url })
+                return jsonify({"img_url": img_url})
        return jsonify({"error": "error processing file"})
    else:
        return jsonify({"error": "upload is a post request"})

+
@app.route('/share', methods=['GET'])
 def share():
    image_url = request.args.get('image_url')
@ -85,5 +90,6 @@ def share():
        height=height
    )

+
 if __name__ == "__main__":
-    app.run(debug=True, host='0.0.0.0')
+    app.run(debug=True, host='0.0.0.0')
--- a/backend/requirements.txt
+++ b/backend/requirements.txt
--- a/science/.gitignore
+++ b/science/.gitignore
@ -1,3 +0,0 @@
-dataset
-dataset_small
-metadata
--- a/science/art_featurizer.py
+++ b/science/art_featurizer.py
@ -1,264 +0,0 @@
-import torch
-import torchvision.models as models
-import torchvision.transforms as transforms
-from torch.autograd import Variable
-from PIL import Image
-
-import numpy as np
-import pickle
-import time
-import json
-import tqdm
-
-import os
-
-import torch.nn as nn
-from torch.utils import data
-import torchvision.datasets as datasets
-from torchvision.models import SqueezeNet, ResNet
-
-
-def load_dataset(batch_size):
-    """ Loads the dataset from the dataset/fonts folder with the specified batch size. """
-
-    data_path = 'dataset/art_dirs/'
-    train_dataset = datasets.ImageFolder(
-        root=data_path,
-        transform=transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
-    )
-    # think about other possible transforms to scale to 224x224? random crop, etc.
-
-    train_loader = data.DataLoader(
-        train_dataset,
-        batch_size=batch_size,
-        num_workers=6,
-        shuffle=False
-    )
-    return train_loader
-
-
-def cosine_similarity(a, b):
-    """ Computes the cosine similarity between two vectors using PyTorch's built-in function. """
-
-    cos = nn.CosineSimilarity(dim=1, eps=1e-6)
-    cos_sim = cos(a.unsqueeze(0), b.unsqueeze(0))
-    # cos_sim = cos(a, b)
-    
-    # print('\nCosine similarity: {0}\n'.format(cos_sim))
-    return cos_sim
-
-
-def get_feature_vector(imgs, model, layer):
-    """ Extracts the feature vectors from a particular model at a particular layer, given a Tensor of images. """
-
-    def get_vector(imgs):
-
-        def vector_size(layer):
-            for param in layer.parameters():
-                return param.shape[1]
-        
-        def copy_data(m, i, o):
-            my_embedding.copy_(o.data.squeeze())
-
-        t_imgs = Variable(imgs)
-
-
-        if isinstance(model, SqueezeNet):
-            my_embedding = torch.zeros(imgs.shape[0], 1000)
-        elif isinstance(model, ResNet):
-            my_embedding = torch.zeros(imgs.shape[0], vector_size(model._modules.get('fc'))) # extracts size of vector from the shape of the FC layer
-
-        h = layer.register_forward_hook(copy_data)
-        model(t_imgs)
-        h.remove()
-
-        return my_embedding
-
-    img_vector = get_vector(imgs).numpy()
-    return img_vector
-
-
-# def get_sample_filenames():
-#     # Small dataset to do testing things
-#     img1 = "datasets/animals/birb.jpg"
-#     img2 = "datasets/animals/birb2.jpg"
-#     img3 = "datasets/animals/bork.jpg"
-#     img4 = "datasets/animals/snek.jpg"
-
-#     imgs = [img1, img2, img3, img4]
-#     return imgs
-
-
-def get_filenames(): 
-    """ Extracts filenames from the dataset/fonts folder. """
-    # TODO replace with better data reading?
-    total_filenames = []
-    root_dir = "dataset/art_dirs/"
-    for root, dirs, files in os.walk(root_dir):
-        for file in files:
-            total_filenames.append(os.path.join(root, file).replace('\\', '/'))
-    return total_filenames
-
-
-def get_metadata(filenames):
-    """ Extracts metadata from filenames and writes them to a .pkl file. """
-    total_metadata = []
-    for filename in filenames:
-        with open(filename) as f:
-            metadata = json.load(f)
-            total_metadata.append(metadata)
-    
-    with open("metadata.json", 'w') as outfile:
-        json.dump(total_metadata, outfile)
-
-
-def process_images(model, dataset):
-    """ Extracts all the feature vectors of a particular dataset using a particular model and writes them to a .pkl file. """
-
-    # start = time.time()
-    total_features = []
-    
-    model.cuda()
-    if isinstance(model, ResNet):
-        layer = model._modules.get('avgpool')
-    elif isinstance(model, SqueezeNet):
-        layer = list(model.children())[-1][-1]
-
-    model.eval()
-
-    print(model.name)
-    count = 0
-    for imgs, label in dataset:
-        try:
-            feature_vector = get_feature_vector(imgs.cuda(), model, layer)
-            total_features.extend(feature_vector)
-            if count%120 == 0:
-                print(count)
-            count += 1
-        except:
-            print("error loading: {}".format(label))
-
-    total_numpy_features = np.array(total_features)
-
-    pickle.dump(total_numpy_features, open("dataset/art_features-" + model.name + ".pkl", "wb"))
-    # end = time.time()
-    # print(end - start)
-    # print(model.name)
-
-
-
-def run_models():
-    """ Extracts the feature vectors from the dataset from a variety of different models. """
-
-    dataset = load_dataset(32)
-    
-
-    # resnet18 = models.resnet18(pretrained=True)
-    # resnet34 = models.resnet34(pretrained=True)
-    # resnet50 = models.resnet50(pretrained=True)
-    # resnet101 = models.resnet101(pretrained=True)
-    # resnet152 = models.resnet152(pretrained=True)
-    
-    # inception = models.inception_v3(pretrained=True)
-    # googlenet = models.googlenet(pretrained=True)
-    # shufflenet = models.shufflenet_v2_x1_0(pretrained=True)
-    # resnext50_32x4d = models.resnext50_32x4d(pretrained=True)
-    # resnext101_32x8d = models.resnext101_32x8d(pretrained=True)
-    # wide_resnet101_2 = models.wide_resnet101_2(pretrained=True)
-    # wide_resnet50_2 = models.wide_resnet50_2(pretrained=True)
-    # alexnet = models.alexnet(pretrained=True)
-    squeezenet = models.squeezenet1_1(pretrained=True)
-    # vgg16 = models.vgg16(pretrained=True)
-    # densenet = models.densenet161(pretrained=True)
-    # mobilenet = models.mobilenet_v2(pretrained=True)
-    # mnasnet = models.mnasnet1_0(pretrained=True)
-
-
-    # resnet18.name = "resnet18"
-    # resnet34.name = "resnet34"
-    # resnet50.name = "resnet50"
-    # resnet101.name = "resnet101"
-    # resnet152.name = "resnet152"
-
-    # inception.name = "inception"
-    # googlenet.name = "googlenet"
-    # shufflenet.name = "shufflenet"
-    # resnext50_32x4d.name = "resnext50_32x4d"
-    # resnext101_32x8d.name = "resnext101_32x8d"
-    # wide_resnet101_2.name = "wide_resnet101_2"
-    # wide_resnet50_2.name = "wide_resnet50_2"
-    # alexnet.name = "alexnet"
-    squeezenet.name = "squeezenet"
-    # vgg16.name = "vgg16"
-    # densenet.name = "densenet"
-    # mobilenet.name = "mobilenet"
-    # mnasnet.name = "mnasnet"
-
-    print("Models created")
-
-    # # only needs to be run once, commented out for now
-    # filenames = get_filenames()
-    # get_metadata(filenames)
-    # print("Metadata done")
-
-    # all_models = [resnet18, alexnet, squeezenet, vgg16, densenet, inception, googlenet,
-    #               shufflenet, mobilenet, resnext50_32x4d, wide_resnet50_2, mnasnet]
-    all_models = [squeezenet]
-
-    for model in all_models:
-        process_images(model, dataset)
-        print(model.name + " done")
-
-
-def benchmark_different():
-    models = ["datasets/features-resnet18.pkl", "datasets/features-resnet34.pkl", "datasets/features-resnet50.pkl", "datasets/features-resnet101.pkl", "datasets/features-resnext50_32x4d.pkl", "datasets/features-wide_resnet50_2.pkl"]
-    # models = ["datasets/features-resnext50_32x4d.pkl"]
-    for model in models:
-        vectors = pickle.load(open(model, "rb"))
-
-        sum = 0
-        for i in range(248):
-            for j in range(248):
-                sum += cosine_similarity(torch.from_numpy(vectors[i]), torch.from_numpy(vectors[j]))
-        
-        print(model)
-        print(sum / 61504)
-
-
-if __name__ == '__main__':
-    torch.multiprocessing.freeze_support()
-
-    run_models()
-    # print(torch.cuda.is_available())
-    # benchmark_different()
-    
-    # print(pickle.load(open("datasets/metadata.pkl", "rb")))
-    # vectors = pickle.load(open("datasets/features-resnet18.pkl", "rb"))
-    # print(cosine_similarity(torch.from_numpy(vectors[10]), torch.from_numpy(vectors[11])))
-
-
-    # datasets/features-resnet18.pkl
-    # tensor([0.8278])
-    # datasets/features-resnet34.pkl
-    # tensor([0.8308])
-    # datasets/features-resnet50.pkl
-    # tensor([0.8584])
-    # datasets/features-resnet101.pkl
-    # tensor([0.8729])
-    # datasets/features-resnext50_32x4d.pkl
-    # tensor([0.8269])
-    # datasets/features-wide_resnet50_2.pkl
-    # tensor([0.8112])
-
-    # datasets/features-resnet18.pkl
-    # tensor([0.8796])
-    # datasets/features-resnet34.pkl
-    # tensor([0.8815])
-    # datasets/features-resnet50.pkl
-    # tensor([0.9034])
-    # datasets/features-resnet101.pkl
-    # tensor([0.9034])
-    # datasets/features-resnext50_32x4d.pkl
-    # tensor([0.8826])
-    # datasets/features-wide_resnet50_2.pkl
-    # tensor([0.8732])
--- a/science/json_concat.py
+++ b/science/json_concat.py
@ -1,24 +0,0 @@
-"""Given json files in a folder, output a json list of all files in alphabetical order by filename"""
-
-import os
-from tqdm import tqdm
-import json
-
-FOLDER_PATH = "./dataset/metadata"
-total_filenames = []
-total_metadata = []
-for root, dirs, files in os.walk(FOLDER_PATH):
-    for file in tqdm(files):
-        total_filenames.append(os.path.join(root, file).replace('\\', '/'))
-
-for filename in tqdm(total_filenames):
-  with open(filename, encoding="utf8") as jsonfile:
-    try:
-      metadata = json.load(jsonfile)
-      total_metadata.append(metadata)
-    except:
-      print("Error parsing: {}".format(filename))
-
-with open("metadata.json", "w") as outfile:
-  json.dump(total_metadata, outfile)
-
--- a/science/load_data.py
+++ b/science/load_data.py
@ -1,78 +0,0 @@
-import torch
-import torch.nn as nn
-from torch.utils import data
-import torchvision.datasets as datasets
-import torchvision.models as models
-import torchvision.transforms as transforms
-from torch.autograd import Variable
-from PIL import Image
-
-def load_dataset(batch_size):
-    data_path = './dataset'
-    train_dataset = datasets.ImageFolder(
-        root=data_path,
-        transform=transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
-    )
-    # think about other possible transforms to scale to 224x224? random crop, etc.
-
-    train_loader = data.DataLoader(
-        train_dataset,
-        batch_size=batch_size,
-        num_workers=6,
-        shuffle=False
-    )
-    return train_loader
-
-# Load the pretrained model
-model = models.resnet18(pretrained=True)
-# Use CUDA to utilize the GPU
-print(torch.cuda.is_available())
-model.cuda()
-# Use the model object to select the desired layer
-layer = model._modules.get('avgpool')
-# Set model to evaluation mode
-model.eval()
-
-
-def image_vectors(imgs):
-
-    def vector_size(layer):
-        for param in layer.parameters():
-            return param.shape[1]
-    
-    def copy_data(m, i, o):
-        my_embedding.copy_(o.data.squeeze())
-
-    t_imgs = Variable(imgs)
-    my_embedding = torch.zeros(imgs.shape[0], vector_size(model._modules.get('fc')))
-
-    h = layer.register_forward_hook(copy_data)
-    model(t_imgs)
-    h.remove()
-
-    return my_embedding
-
-
-def cosine_similarity(a, b):
-    # Using PyTorch Cosine Similarity
-    cos = nn.CosineSimilarity(dim=1, eps=1e-6)
-    cos_sim = cos(a.unsqueeze(0), b.unsqueeze(0))
-    
-    # print('\nCosine similarity: {0}\n'.format(cos_sim))
-    return cos_sim
-
-
-if __name__ == '__main__':
-    torch.multiprocessing.freeze_support()
-
-    batch_size = 64
-    dataset = load_dataset(batch_size)
-    
-    for batch, labels in dataset:
-        a = image_vectors(batch.cuda()) # features, matrix of dimensions (batch size) x (vector size)
-        b = labels
-    
-
-    # test on dogs and cats
-    for pair in [(0, 1), (2, 3), (0, 2), (1, 3)]:
-        print(cosine_similarity(a[pair[0]], a[pair[1]]))
--- a/science/re_folderize.py
+++ b/science/re_folderize.py
@ -1,15 +0,0 @@
-import os
-
-def extract_image_type(filename):
-    split_name = int(filename.split("_")[2])
-    return split_name
-
-base = "C://Users/v-stfu/Documents/GitHub/art/data science/dataset/"
-
-for root, dirs, files in os.walk(base + "art"):
-    for file in files:
-        image_type = extract_image_type(file)
-        if not os.path.exists(base + "art_dirs/" + str(image_type)):
-            os.mkdir(base + "art_dirs/" + str(image_type))
-        os.rename("C://Users/v-stfu/Documents/GitHub/art/data science/dataset/art/" + file,
-                  "C://Users/v-stfu/Documents/GitHub/art/data science/dataset/art_dirs/" + str(image_type) + "/" + file)
--- a/science/test/cats/cat.jpg
+++ b/science/test/cats/cat.jpg
--- a/science/test/cats/kitten.jpg
+++ b/science/test/cats/kitten.jpg
--- a/science/test/dogs/dog.jpg
+++ b/science/test/dogs/dog.jpg
--- a/science/test/dogs/doggy.jpg
+++ b/science/test/dogs/doggy.jpg
--- a/data_prep/download_images.py
+++ b/data_prep/download_images.py
@ -0,0 +1,23 @@
+import os
+
+import pandas as pd
+import requests
+from tqdm import tqdm
+
+batch_size = 256
+img_width = 225
+img_height = 225
+model = "resnet"
+metadata_fn = "metadata.json"
+data_dir = "images"
+
+os.makedirs(data_dir, exist_ok=True)
+metadata = pd.read_json(metadata_fn, lines=True)
+for i, row in tqdm(metadata.iterrows()):
+    target_file = os.path.join(data_dir, row["id"] + ".jpg")
+    if not os.path.exists(target_file):
+        try:
+            with open(target_file, 'wb') as f:
+                f.write(requests.get(row["Thumbnail_Url"]).content)
+        except Exception as e:
+            print(e)
--- a/data_prep/featurize_and_match.py
+++ b/data_prep/featurize_and_match.py
@ -0,0 +1,128 @@
+import os
+
+import pandas as pd
+import torch
+import torch.nn as nn
+from PIL import Image
+from torch.utils.data import Dataset
+from torchvision import models, transforms
+from tqdm import tqdm
+import numpy as np
+import pickle
+
+data_dir = 'images'
+metadata_fn = "metadata.json"
+features_dir = "features"
+features_file = os.path.join(features_dir, "pytorch_rn50.pkl")
+featurize_images = True
+device = torch.device("cuda:0")
+
+os.makedirs(data_dir, exist_ok=True)
+os.makedirs(features_dir, exist_ok=True)
+
+if featurize_images:
+    class ArtDataset(Dataset):
+        """Face Landmarks dataset."""
+
+        def __init__(self, metadata_json, image_dir, transform):
+            """
+            Args:
+                csv_file (string): Path to the csv file with annotations.
+                root_dir (string): Directory with all the images.
+                transform (callable, optional): Optional transform to be applied
+                    on a sample.
+            """
+            self.metadata = pd.read_json(metadata_json, lines=True)
+            self.image_dir = image_dir
+            self.transform = transform
+
+        def __len__(self):
+            return len(self.metadata)
+
+        def __getitem__(self, idx):
+            if torch.is_tensor(idx):
+                idx = idx.tolist()
+            metadata = self.metadata.iloc[idx]
+            with open(os.path.join(self.image_dir, metadata["id"] + ".jpg"), "rb") as f:
+                image = Image.open(f).convert("RGB")
+            return self.transform(image), metadata["id"]
+
+
+    data_transform = transforms.Compose([
+        transforms.Resize((224, 224)),
+        transforms.ToTensor(),
+        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+    ])
+    dataset = ArtDataset(metadata_fn, data_dir, data_transform)
+    data_loader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=False, num_workers=4)
+
+    dataset_size = len(dataset)
+
+    # Get a batch of training data
+    model = models.resnet50(pretrained=True)
+    model.eval()
+    model.to(device)
+    cut_model = nn.Sequential(*list(model.children())[:-1])
+
+    all_outputs = []
+    all_ids = []
+    for i, (inputs, ids) in enumerate(tqdm(data_loader)):
+        inputs = inputs.to(device)
+        outputs = torch.squeeze(cut_model(inputs)).detach().cpu().numpy()
+        all_outputs.append(outputs)
+        all_ids.append(list(ids))
+
+    all_outputs = np.concatenate(all_outputs, axis=0)
+    all_ids = np.concatenate(all_ids, axis=0)
+
+    with open(features_file, "wb+") as f:
+        pickle.dump((all_outputs, all_ids), f)
+
+with open(features_file, "rb") as f:
+    with torch.no_grad():
+        (all_outputs, all_ids) = pickle.load(f)
+        all_urls = np.array(pd.read_json(metadata_fn, lines=True).loc[:, "Thumbnail_Url"])
+        features = torch.from_numpy(all_outputs).float().to("cpu:0")
+        features = features / torch.sqrt(torch.sum(features ** 2, dim=1, keepdim=True))
+        features = features.to(device)
+        indicies = torch.arange(0, features.shape[0]).to(device)
+        print("loaded features")
+
+        metadata = pd.read_json(metadata_fn, lines=True)
+        culture_arr = np.array(metadata["Culture"])
+        cultures = metadata.groupby("Culture").count()["id"].sort_values(ascending=False).index.to_list()
+        media_arr = np.array(metadata["Classification"])
+        media = metadata.groupby("Classification").count()["id"].sort_values(ascending=False).index.to_list()
+        ids = np.array(metadata["id"])
+
+
+        masks = {"culture": {}, "medium": {}}
+        for culture in cultures:
+            masks["culture"][culture] = torch.from_numpy(culture_arr == culture).to(device)
+        for medium in media:
+            masks["medium"][medium] = torch.from_numpy(media_arr == medium).to(device)
+
+
+        all_matches = []
+        for i, row in tqdm(metadata.iterrows()):
+            feature = features[i]
+            matches = {"culture": {}, "medium": {}}
+            all_dists = torch.sum(features * feature, dim=1).to(device)
+            for culture in cultures:
+                selected_indicies = indicies[masks["culture"][culture]]
+                k = min(10, selected_indicies.shape[0])
+                dists, inds = torch.topk(all_dists[selected_indicies], k, sorted=True)
+                matches["culture"][culture] = ids[selected_indicies[inds].cpu().numpy()]
+            for medium in media:
+                selected_indicies = indicies[masks["medium"][medium]]
+                k = min(10, selected_indicies.shape[0])
+                dists, inds = torch.topk(all_dists[selected_indicies], k, sorted=True)
+                matches["medium"][medium] = ids[selected_indicies[inds].cpu().numpy()]
+            all_matches.append(matches)
+
+        metadata["matches"] = all_matches
+
+        metadata.to_json("results/metadata_enriched.json")
+        print("here")
+
+