This commit is contained in:
Mark Hamilton 2020-07-22 23:04:21 -04:00
Родитель 0647918785
Коммит dd1d8d3170
25 изменённых файлов: 201 добавлений и 1440 удалений

Просмотреть файл

@ -1,19 +0,0 @@
FROM tensorflow/tensorflow:1.13.2-gpu
ARG SPARK_VERSION=2.4.3
RUN apt-get -qq update && apt-get -qq -y install curl bzip2 gcc \
&& curl -sSL https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh \
&& bash /tmp/miniconda.sh -bfp /usr/local \
&& rm -rf /tmp/miniconda.sh \
&& conda install -y python=3 \
&& conda update conda \
&& apt-get install openjdk-8-jre-headless -y \
&& conda install pyspark=${SPARK_VERSION} \
&& apt-get -qq -y remove curl bzip2 \
&& apt-get -qq -y autoremove \
&& apt-get autoclean \
&& rm -rf /var/lib/apt/lists/* /var/log/dpkg.log \
&& conda clean --all --yes
ENV PATH /opt/conda/bin:$PATH
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64

Просмотреть файл

@ -1,59 +1,48 @@
## For Evaluation
Evaluation notebook on databricks - input 2 pickle files:
# Mosaic
**metadata.pkl**: list of (content, style) tuples
## About
**features-{model}.pkl**: 2D numpy array (data points x length of feature vector) of feature vectors
## Architecture
features.py writes out the 2 pkl files for resnets currently.
## Paper
Work done on day one of the hackathon, deploys model onto AzureML to be run from a webservice
## Building from Scratch
## Contents
### Backend
File/Folder: Description
+ 'azureml': All code is here
+ '.vs'
+ '.vscode'
+ 'call.py': Regression model, run to simulate a call to the model
+ 'call2.py': Resnet50 model, run to simulate a call to the model **(CURRENTLY NOT WORKING)**
+ 'deploy.py': Resgression model, deploys the model to AzureML
+ 'deploy2.py': Resnet50 model, deploys the model to AzureML **(CURRENTLY NOT WORKING)**
+ 'deploymentConfig.yml': configuration details for deploying the model to AzureML
+ 'my_model.h5': Model of Resnet50, may need modification
+ 'myenv.yml': details about the python environment
+ 'panda.jpg': image of a panda used for testing
+ 'score.py': Regression model, handles the actual calculations when called from call.py
+ 'score2.py': Resnet model, handles the actual classification when called from call2.py
+ 'sklearn_regression_model.pkl': Regression model
+ 'gitignore': What to ignore at commit time
+ 'CODE_OF_CONDUCT': Code of conduct
+ 'LICENSE': The license for the files
+ 'README': This README file
+ 'SECURITY': Security information
1. Download Image Metadata:
```bash
wget https://mmlsparkdemo.blob.core.windows.net/cknn/metadata.json?sv=2019-02-02&st=2020-07-23T02%3A22%3A30Z&se=2023-07-24T02%3A22%3A00Z&sr=b&sp=r&sig=hDnGw9y%2BO5XlggL6br%2FPzSKmpAdUZ%2F1LJKVkcmbVmCE%3D
```
1. Download Images:
```bash
cd data_prep
python download_images.py
```
1. Featurize and perform Conditional Image Retrieval on every image
```bash
cd data_prep
python featurize_and_match.py
```
1. Write enriched information to an Azure Search Index.
More detailed code coming soon, Follow [the closely related guide](
https://docs.microsoft.com/en-us/azure/cognitive-services/big-data/recipes/art-explorer) for a similiar example.
## Prerequisites
### Frontend
Visual Studio/Visual Studio Code (if you use Visual Studio Code to set up your python environment use ctrl+shift+p, "Python: Select Interpreter") \
Anaconda 3.7 \
Azure Machine Learning workspace \
Azure Machine Learning SDK (https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py) \
Docker \
Tensorflow \
Keras
1. Install `npm` if you dont already have it. You can find instructions at [https://nodejs.org/](https://nodejs.org/).
1. Install dependencies:
```bash
cd frontend
npm install
```
1. Start the development server:
```bash
npm start
```
1. Navigate to [http://localhost:3000/art](http://localhost:3000/art) to explore the local website.
## Setup
One time run of either deploy.py or deploy2.py depending on which model is being used to deploy the model to the target
## Running the sample
Run call.py or call2.py depending on which model is being used to simulate a call to the AzureML model
## Key concepts
The general idea is that we are deploying a machine learning model to the cloud as a webservice. The regression model was mostly proof of concept and the Resnet50 model is going to be closer to what we will end up using since we will be handling images.
## Contributing

Просмотреть файл

@ -1,81 +0,0 @@
# Model Training and Inference
- [Model Training and Inference](#model-training-and-inference)
- [File Structure](#file-structure)
- [Getting Started](#getting-started)
- [Install the Python Dependencies](#install-the-python-dependencies)
- [Deploying Featurization Script](#deploying-featurization-script)
- [Training](#training)
- [Service Deployment](#service-deployment)
Mosaic allows users to find similar artworks by featurizing artwork images using a pretrained Keras model, normalizing the resultant vector, and loading them into a ball tree to quickly query for other artwork with similar featurizations filtered by either culture or classification (m).
## File Structure
This folder contains various scripts and configuration files that either are deployed on Azure or automate the deployment process.
- `featurize.py` is deployed to Azure Machine Learning as an experiment to read image metadata from a mounted Azure Storage blob, download the images, featurize the images, and save them into a ball tree in the file system.
- `deploy_featurize.py` automates the deployment of `featurize.py` by mounting the storage blob, spinning up a GPU cluster, and running the experiment on the cluster. Once the experiment is complete, it registers the entire `output/` folder as a model named `mosaic_model`
- `score.py` is deployed to Azure Machine Learning as a web service that allows clients to query the model. An initialization function `init()` loads the model (trained by `featurize.py`) from disk and optionally asserts that Tensorflow is able to detect a GPU. In `run(request)`, we receive an `AMLRequest` object where we can read the entire request object (HTTP method, query params, etc.) to determine the inputs for running model inference. We query the model and return the list of similar artwork filtered by the request parameters.
- `deploy_score_local.py` runs an instance of `score.py` for local debugging. It builds the Docker image and saves it in the local Docker images. It then attempts to run the Docker container and gives the user a URL to access the inference server if successful. Runtime ranges from 5-20 minutes.
- `deploy_score_aks.py` runs an instance of `score.py` in an AKS cluster. It attempts to attach to a cluster and service if already running, otherwise it creates a service on an existing or new cluster. It then deploys the model and script onto the cluster. Runtime ranges from 10-20 minutes.
- `./GPU_Docker/Dockerfile` is a Dockerfile that specifies how to build the base image for training and scoring. It includes `tensorflow-gpu` for GPU drivers, `Java` for `pyspark`, and an installation of `Anaconda`. This Dockerfile has been built and hosted on [DockerHub](https://hub.docker.com/repository/docker/typingkoala/mosaic_base_image) in the repo `typingkoala/mosaic_base_image`.
- `call_service.py` is a script that makes a post request to our web service, printing the response.
## Getting Started
In order to deploy Mosaic, you will need the following installed on your computer.
- Python 3
- Docker
### Install the Python Dependencies
First, install the AzureML Python SDK. Make sure to activate your virtual environment if you are using one.
```bash
pip install --upgrade azureml-sdk
```
### Deploying Featurization Script
In order to begin online featurization, we first edit the `deploy_featurize.py` script with the appropriate workspace and Azure Storage information. On the first run, you will be prompted to log in to Microsoft using interactive authentication. Once completed, your authentication information will be cached locally for future runs.
```bash
python azureml/deploy_featurize.py
```
Running `deploy_featurize.py` will attach to a cluster (or create one if it doesn't exist) with the name specified in the script. It will then submit the `featurize.py` script as a job to complete. Logs will stream from the cluster to the local terminal. Once the script runs, the `outputs/` folder will be registered as a model so that it can be mounted to the inference cluster for serving web traffic.
## Training
The Ball Tree API originates from [MMLSpark](https://github.com/Azure/mmlspark). It allows for the initialization of a conditional ball tree with three methods: `findMaximumInnerProducts`, `save`, and `load`.
The featurization of the images and the creation of the balltrees is done in `featurize.py`. The file reads a csv from a mounted storage blob and downloads the images from the provided urls. The images are featurized using the embeddings from ResNet50 and then used to create balltree objects.
The training is run through `deploy_featurize.py` as an experiment on Azure Machine Learning (AML). It mounts the storage blob for `featurize.py`, submits the run, then saves the balltree objects and metadata in a model to be referenced later.
This can be run either through AML training clusters or locally to speed up the dev loop. Make sure the [workspace settings](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) are correct before running. The settings of the cluster can be altered in provisioning_config, such as vm size and number of nodes. Setting min_nodes = 0 will allow the cluster to scale to 0 nodes when not in use. To run locally, the container can be downloaded from Azure Container Registry, but only after you run it through AML. The repository URL can be found through Azure Container Registry, and will resemble `extenamls.azurecr.io/azureml/azureml_0062a8f080ece0d27d:latest`.
```bash
docker run -d -it --name <name> --mount type=bind,source=<source_directory>,target=/app <repository_url>
```
The docker exec command will enable debugging through the docker bash terminal
```bash
docker exec -it <name> bash
```
## Service Deployment
`score.py` is a web service that allows for clients to query our model. It handles GET requests, expecting the following parameters: `url`, `n`, `culture` or `classification.ation`. It loads the balltrees and metadata pickle created in `featurize.py`, then downloads the provided URL and featurizes it. The featurized image is put into either the culture or classifcation balltree along with the number of results desired, returning the closest matches. The metadata for the results is then sent as a serialized JSON object.
The web service is deployed through `deploy_score_aks.py` to an inference cluster on Azure Kupernetes Service. It tries to first update an existing service, but if that fails it will create either a new service or a new cluster and service.
The service can be deployed to a cluster or locally. Make sure the [workspace settings](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) are correct before running. The settings for the inference cluster can be changed in gpu_aks_config. To deploy it locally, run `deploy_score_local.py`.

Просмотреть файл

@ -1,15 +0,0 @@
import requests
import json
resp = requests.post(
"https://extern2020apim.azure-api.net/cknn/",
json={
"url":"https://mmlsparkdemo.blob.core.windows.net/rijks/resized_images/AK-BR-324.jpg",
"n":5,
"query":"prints"})
print(resp.text)
response_data = json.loads(resp.content, encoding="utf-8")
if "results" not in response_data.keys():
raise Exception("FAILED: No results field.")

Просмотреть файл

@ -1,62 +0,0 @@
import os
from azureml.core import Datastore
from azureml.core import Experiment, Workspace
from azureml.core.compute import ComputeTarget
from azureml.core.compute.amlcompute import AmlComputeProvisioningConfiguration
from azureml.train.estimator import Estimator
ws = Workspace(
subscription_id="f9b96b36-1f5e-4021-8959-51527e26e6d3",
resource_group="marhamil-mosaic",
workspace_name="mosaic-aml"
)
datastore = Datastore.register_azure_blob_container(
workspace=ws,
datastore_name='mosaic_datastore',
container_name='mosaic',
account_name='mmlsparkdemo',
sas_token="?sv=2019-02-02&ss=bf&srt=sco&sp=rlc&se=2030-01-23T04:14:29Z"
"&st=2020-01-22T20:14:29Z&spr=https,http&sig=nPlKziG9ppu4Vt5"
"b6G%2BW1JkxHYZ1dlm39mO2fMZlET4%3D",
create_if_not_exists=True)
cluster_name = "training-4"
try:
# Connecting to pre-existing cluster
compute_target = ComputeTarget(ws, cluster_name)
print("Found existing cluster...")
except:
# Create a new cluster to train on
provisioning_config = AmlComputeProvisioningConfiguration(
vm_size="Standard_D4_v2",
min_nodes=0,
max_nodes=1
)
compute_target = ComputeTarget.create(ws, cluster_name, provisioning_config)
compute_target.wait_for_completion(show_output=True)
# Create and run the experiment
exp = Experiment(workspace=ws, name='featurize_artwork_4')
estimator = Estimator(
source_directory=".",
entry_script="featurize.py",
script_params={
"--data-dir": datastore.as_mount()
},
conda_dependencies_file=os.path.join(os.path.dirname(os.path.realpath(__file__)), "myenv.yml"),
use_docker=True,
custom_docker_image="typingkoala/mosaic_base_image:1.0.0",
compute_target=compute_target
)
run = exp.submit(estimator)
run.wait_for_completion(show_output=True)
# Save the balltrees made in score.py and metadata
run.register_model(
model_name="mosaic_model_4",
model_path="outputs/"
)

Просмотреть файл

@ -1,28 +0,0 @@
from azureml.core import Workspace
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import LocalWebservice
# Set your workspace
ws = Workspace(
subscription_id="ce1dee05-8cf6-4ad6-990a-9c80868800ba",
resource_group="extern2020",
workspace_name="exten-amls"
)
# Settings for deployment
inference_config = InferenceConfig(
entry_script="rationale.py",
runtime="python",
source_directory=".",
conda_file="myenv.yml",
base_image="typingkoala/mosaic_base_image:1.0.0")
# Load existing model
model = Model(ws, name="mosaic_model")
# Deploy model locally
deployment_config = LocalWebservice.deploy_configuration(port=8890)
service = Model.deploy(ws, "rationalizing", [model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)
print(service.state)

Просмотреть файл

@ -1,76 +0,0 @@
from azureml.core import Workspace
from azureml.core.compute import AksCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AksWebservice
from azureml.exceptions import WebserviceException
ws = Workspace(
subscription_id="f9b96b36-1f5e-4021-8959-51527e26e6d3",
resource_group="marhamil-mosaic",
workspace_name="mosaic-aml"
)
inference_config = InferenceConfig(
entry_script="score.py",
runtime="python",
source_directory=".",
conda_file="myenv.yml",
base_image="typingkoala/mosaic_base_image:1.0.0")
resource_group = 'extern2020'
cluster_name = 'aks-gpu'
service_name = 'artgpuservice'
"""
Creates a cluster if one by the name of cluster_name does not already exist.
Deploys a service to the cluster if one by the name of service_name does not already exist, otherwise it will update the existing service.
"""
try: # If cluster and service exists
aks_target = AksCompute(ws, cluster_name)
service = AksWebservice(name=service_name, workspace=ws)
# print(service.get_logs(num_lines=5000))
print("Updating existing service: {}".format(service_name))
service.update(inference_config=inference_config, auth_enabled=False)
service.wait_for_deployment(show_output=True)
except WebserviceException: # If cluster but no service
# Creating a new service
aks_target = AksCompute(ws, cluster_name)
print("Deploying new service: {}".format(service_name))
gpu_aks_config = AksWebservice.deploy_configuration(
autoscale_enabled=False,
num_replicas=1,
cpu_cores=2,
memory_gb=16,
auth_enabled=False)
service = Model.deploy(ws, service_name, [], inference_config, gpu_aks_config, aks_target, overwrite=True)
service.wait_for_deployment(show_output = True)
except ComputeTargetException: # If cluster doesn't exist
print("Creating new cluster: {}".format(cluster_name))
# Provision AKS cluster with GPU machine
prov_config = AksCompute.provisioning_configuration(
vm_size="Standard_NC6",
cluster_purpose=AksCompute.ClusterPurpose.DEV_TEST)
# Create the cluster
aks_target = ComputeTarget.create(
workspace=ws, name=cluster_name, provisioning_configuration=prov_config,
)
aks_target.wait_for_completion(show_output=True)
print("Deploying new service: {}".format(service_name))
gpu_aks_config = AksWebservice.deploy_configuration(
autoscale_enabled=False,
num_replicas=1,
cpu_cores=2,
memory_gb=16,
auth_enabled=False)
service = Model.deploy(ws, service_name, [], inference_config, gpu_aks_config, aks_target, overwrite=True)
service.wait_for_deployment(show_output = True)
print("State: " + service.state)
print("Scoring URI: " + service.scoring_uri)

Просмотреть файл

@ -1,28 +0,0 @@
from azureml.core import Workspace
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import LocalWebservice
# Set your workspace
ws = Workspace(
subscription_id="ce1dee05-8cf6-4ad6-990a-9c80868800ba",
resource_group="extern2020",
workspace_name="exten-amls"
)
# Settings for deployment
inference_config = InferenceConfig(
entry_script="score.py",
runtime="python",
source_directory=".",
conda_file="myenv.yml",
base_image="typingkoala/mosaic_base_image:1.0.0")
# Load existing model
model = Model(ws, name="mosaic_model")
# Deploy model locally
deployment_config = LocalWebservice.deploy_configuration(port=8890)
service = Model.deploy(ws, "scoring", [model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)
print(service.state)

Просмотреть файл

@ -1,316 +0,0 @@
import argparse
import os
import pickle
import urllib.request
from azureml.core import Run, Workspace
from multiprocessing import Pool
import numpy as np
import pandas as pd
import tensorflow as tf
from PIL import Image
from keras.applications.resnet50 import ResNet50
from keras.applications.resnet50 import preprocess_input
from pyspark.sql import SparkSession
from pyspark import SparkContext, SQLContext
# Initialize
batch_size = 512
img_width = 225
img_height = 225
model = "resnet"
os.environ["CUDA_VISIBLE_DEVICES"] = str(0)
# gets mount location from aml_feat.py, passed in as args
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=str, dest="data_folder")
data_folder = parser.parse_args().data_folder
tsv_path = os.path.join(data_folder, "met_rijks_metadata.tsv")
# create file paths for saving balltree later
output_root = './outputs'
features_culture_fn = os.path.join(output_root, 'features_culture.ball')
features_classification_fn = os.path.join(output_root, 'features_classification.ball')
metadata_fn = os.path.join(output_root, 'metadata.pkl')
#cached_features_url = "https://mmlsparkdemo.blob.core.windows.net/mosaic/features_and_successes.pkl"
#cached_features_fn = 'features_and_successes.pkl'
cached_features_url = "https://mmlsparkdemo.blob.core.windows.net/mosaic/met_and_rijks_art_with_features.parquet.zip"
cached_features_fn = "met_and_rijks_art_with_features.parquet.zip"
parquet_fn = "met_and_rijks_art_with_features.parquet"
write_to_index = False
# downloading java dependencies
print(os.environ.get("JAVA_HOME", "WARN: No Java home found"))
spark = SparkSession.builder \
.master("local[*]") \
.appName("TestConditionalBallTree") \
.config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1-38-a6970b95-SNAPSHOT") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.config("spark.executor.heartbeatInterval", "60s") \
.config("spark.driver.memory", "32g") \
.config("spark.driver.maxResultSize", "8g") \
.getOrCreate()
from mmlspark.nn.ConditionalBallTree import ConditionalBallTree
# Featurize
def batch(iterable, n):
"""
Splits iterable into nested array with inner size n
"""
current_batch = []
for item in iterable:
if item is not None:
current_batch.append(item)
if len(current_batch) == n:
yield current_batch
current_batch = []
if current_batch:
yield current_batch
def parallel_apply(df, func, n_cores=16):
df_split = np.array_split(df, n_cores)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
def retry(func, args, times):
if times == 0:
return False
else:
try:
func(args)
return True
except Exception as e:
print(e)
retry(func, args, times - 1)
def download_image_inner(metadata_row):
"""
Download an image from the given url and save it to disk as filename {museum}_{id}{extension}
where the extension is inferred from content type. Returns None if download fails twice.
"""
url = str(metadata_row["Thumbnail_Url"])
museum = str(metadata_row["Museum"])
local_file = "images/" + museum + "_" + url.split("/")[-1]
if not os.path.exists(local_file):
urllib.request.urlretrieve(url, local_file)
def download_image(metadata_row):
return retry(download_image_inner, metadata_row, 3)
def download_image_df(df):
df["Success"] = df.apply(download_image, axis=1)
return df
def load_images(rows, successes):
"""Given an array of images, return a numpy array of PIL image data
after reading the image's filename from disk.
If successful, append the image dict (with metadata) to the
provided metadata list, respectively.
Arguments:
images {dict[]} -- array of dictionaries that represent images of artwork
metadata {dict[]} -- array of image objects to append to if reading is successful
Returns:
Image[] -- an array of Pillow images
"""
batch = []
for i, row in rows:
filename = "images/" + row["Museum"] + "_" + row["Thumbnail_Url"].split("/")[-1]
try:
batch.append(load_image(filename))
successes.append(row)
except Exception as e:
print(e)
print("Failed to load image: " + filename)
return np.array(batch)
def load_image(filename):
"""
Given a filename of a musueum image, make sure that it is an RGB image and resize and preprocess the image.
Arguments:
filename {str} -- filename of the image to preprocess
Returns:
img {Image}-- an Image object that has been turned into an RGB image, resized, and preprocessed
"""
img = Image.open(filename)
# non RGB images won't have the right number of channels
if img.mode != 'RGB':
img = img.convert('RGB')
# re-size, expand dims and run through the ResNet50 model
img = np.array(img.resize((img_width, img_height)))
img = preprocess_input(img.astype(np.float))
return img
def assert_gpu():
"""
This function will raise an exception if a GPU is not available to tensorflow.
"""
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
if not os.path.exists(tsv_path):
urllib.request.urlretrieve("https://mmlsparkdemo.blob.core.windows.net/mosaic/met_rijks_metadata.tsv",
tsv_path)
metadata = pd.read_csv(tsv_path, delimiter="\t", keep_default_na=False)
metadata.fillna('') # replace nan values with empty string
if write_to_index:
ws = Workspace(
subscription_id="ce1dee05-8cf6-4ad6-990a-9c80868800ba",
resource_group="extern2020",
workspace_name="exten-amls"
)
keyvault = ws.get_default_keyvault()
run = Run.get_context()
subscription_key = keyvault.get_secret(name="subscriptionKey")
image_subscription_key = keyvault.get_secret(name="imageSubscriptionKey")
from mmlspark.cognitive import AnalyzeImage
from mmlspark.stages import SelectColumns
import base64
def url_encode_id(idval):
return base64.b64encode(bytes(idval, "UTF-8")).decode("utf-8")
describeImage = (AnalyzeImage()
.setSubscriptionKey(image_subscription_key)
.setLocation("eastus")
.setImageUrlCol("Thumbnail_Url")
.setOutputCol("RawImageDescription")
.setErrorCol("Errors")
.setVisualFeatures(["Categories", "Tags", "Description", "Faces", "ImageType", "Color", "Adult"])
.setConcurrency(5))
df = spark.createDataFrame(metadata)
df2 = describeImage.transform(df) \
.select("*", "RawImageDescription.*").drop("Errors", "RawImageDescription").cache()
df2.coalesce(3).writeToAzureSearch(
subscriptionKey=subscription_key,
actionCol="searchAction",
serviceName="extern-search",
indexName="merged-art-search-6",
keyCol="id",
batchSize="1000"
)
if cached_features_url is not None:
if not os.path.exists(cached_features_fn):
urllib.request.urlretrieve(cached_features_url, cached_features_fn)
if not os.path.exists(parquet_fn):
print("extracting")
from zipfile import ZipFile
with ZipFile(cached_features_fn, 'r') as zipObj:
zipObj.extractall()
print("done extracting")
#with open(cached_features_fn, "rb") as f:
# [features, successes] = pickle.load(f)
#print("Loaded cached features")
else:
# create directory for downloading images, then download images simultaneously
print("Downloading images...")
os.makedirs("images", exist_ok=True)
metadata = parallel_apply(metadata, download_image_df)
metadata = metadata[metadata["Success"].fillna(False)] # filters out unsuccessful rows
batches = list(batch(metadata.iterrows(), batch_size))
successes = [] # clear metadata, images are only here if they are loaded from disk
data_iterator = (load_images(batch, successes) for batch in batches)
# featurize the images then normalize them
keras_model = ResNet50(
input_shape=[img_width, img_height, 3],
weights='imagenet',
include_top=False,
pooling='avg'
)
assert_gpu() # raises exception if gpu is not available
features = keras_model.predict_generator(data_iterator, steps=len(batches), verbose=1)
features /= np.linalg.norm(features, axis=1).reshape(len(successes), 1)
with open(cached_features_fn, "wb+") as f:
pickle.dump([features, successes], f)
# print(features.shape)
#
# from py4j.java_collections import ListConverter
#
# # convert to list and then create the two balltrees for culture and classification(medium)
# ids = [row["id"] for row in successes]
# features = features.tolist()
# print("fitting culture ball tree")
#
# converter = ListConverter()
# gc = SparkContext._active_spark_context._jvm._gateway_client
# from pyspark.ml.linalg import Vectors, VectorUDT
#
# java_features = converter.convert(features,gc)
# java_cultures = converter.convert([row["Culture"] for row in successes], gc)
# java_classifications = converter.convert([row["Classification"] for row in successes], gc)
# java_values = converter.convert(ids, gc)
#
# cbt_culture = ConditionalBallTree(java_features, java_values, java_cultures, 50)
# print("fitting class ball tree")
#
# cbt_classification = ConditionalBallTree(java_features, java_values, java_classifications, 50)
# print("fit culture ball tree")
#
# # save the balltrees to output directory and pickle the museum and id metadata
# os.makedirs(output_root, exist_ok=True)
# cbt_culture.save(features_culture_fn)
# cbt_classification.save(features_classification_fn)
# pickle.dump(successes, open(metadata_fn, 'wb+'))
from mmlspark.nn import *
df = spark.read.parquet(parquet_fn)
cols_to_group = df.columns
cols_to_group.remove("Norm_Features")
from pyspark.sql.functions import struct
df2 = df.withColumn("Meta", struct(*cols_to_group))
cknn_classification = (ConditionalKNN()
.setOutputCol("Matches")
.setFeaturesCol("Norm_Features")
.setValuesCol("Meta")
.setLabelCol("Classification")
.fit(df2))
cbt_classification = ConditionalBallTree(None, None, None, None, cknn_classification._java_obj.getBallTree())
cknn_culture = (ConditionalKNN()
.setOutputCol("Matches")
.setFeaturesCol("Norm_Features")
.setValuesCol("Meta")
.setLabelCol("Culture")
.fit(df2))
cbt_culture = ConditionalBallTree(None, None, None, None, cknn_culture._java_obj.getBallTree())
os.makedirs(output_root, exist_ok=True)
cbt_culture.save(features_culture_fn)
cbt_classification.save(features_classification_fn)

Просмотреть файл

@ -1,11 +0,0 @@
name: project_environment
dependencies:
- python=3.6.2
- tensorflow-gpu
- numpy
- keras
- Pillow
- pyspark
- pip:
- azureml-defaults[services]
- tqdm

Просмотреть файл

@ -1,160 +0,0 @@
import os
from io import BytesIO
import json
import os
import traceback
from io import BytesIO
import numpy as np
import requests
import tensorflow as tf
from PIL import Image
from azureml.contrib.services.aml_request import rawhttp
from azureml.contrib.services.aml_response import AMLResponse
from azureml.core.model import Model
from keras.applications.resnet50 import ResNet50, preprocess_input
from pyspark.sql import SparkSession
import keras
from keras.preprocessing import image
from skimage.io import imread
import matplotlib.pyplot as plt
import shap
import sys
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
import random
from keras.layers import Input, Lambda
from keras import Model
import keras.backend as K
from scipy.ndimage.filters import gaussian_filter
def prep_image(url, preprocess=False):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
if img.mode != 'RGB':
img = img.convert('RGB')
x = np.array(img.resize((224, 224)))
x = np.expand_dims(x, axis=0)
if preprocess:
return preprocess_input(x), img.size
return x, img.size
def setup_model():
keras_model = ResNet50(input_shape=[224, 224, 3], weights='imagenet', include_top=False, pooling='avg')
im1 = Input([224, 224, 3])
f1 = keras_model(im1)
return keras_model, im1, f1
def inv_logit(y):
return tf.math.log(y/(1-y))
def precompute(original_url):
training_img, training_img_size = prep_image(original_url)
x_train = np.array(gaussian_filter(training_img, sigma=4))
query_img, query_img_size = prep_image(url, True)
im2_const = tf.constant(query_img, dtype=tf.float32)
im2 = Lambda(lambda im1: im2_const)(im1)
f2 = keras_model(im2)
d = keras.layers.Dot(1, normalize=True)([f1, f2])
logit = Lambda(lambda d: inv_logit((d+1)/2))(d)
model = Model(inputs=[im1], outputs=[logit])
e = shap.DeepExplainer(model, x_train)
return e
def test_match(match_url, e):
test_image, size = prep_image(match_url)
x_test = np.array(test_image)
shap_values = e.shap_values(x_test, check_additivity=False)
shap_values_normed = np.array(shap_values)
shap_values_normed = np.linalg.norm(shap_values_normed, axis=4)
blurred = gaussian_filter(shap_values_normed[0], sigma=4)
bflat = blurred.flatten()
shap_values_mask_qi = np.where(np.array(blurred) > np.mean(bflat) + np.std(bflat), 1, 0).reshape(224, 224, 1)
shap_values_qi = np.multiply(shap_values_mask_qi, x_test[0])
new_size = (224, int(size[1]/size[0]*224)) if size[0] > size[1] else (int(size[0]/size[1]*224), 224)
original_size = Image.fromarray(shap_values_qi.astype(np.uint8), 'RGB').resize(new_size)
return original_size
# # run before any queries
# keras_model, im1, f1 = setup_model()
# # run only once for each original image
# original_url = "https://mmlsparkdemo.blob.core.windows.net/cknn/datasets/interpret/lex1.jpg" # replace with link to original image
# e = precompute(original_url)
# # run for each new matched image
# match_url = "https://mmlsparkdemo.blob.core.windows.net/cknn/datasets/interpret/lex2.jpg" # replace with link to matched image
# explained_pic = test_match(match_url, e)
def assert_gpu():
"""
This function will raise an exception if a GPU is not available to tensorflow.
"""
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
def init():
global keras_model, im1, f1
keras_model, im1, f1 = setup_model()
print('yay')
def error_response(err_msg):
"""Returns an error response for a given error message
Arguments:
err_msg {str} -- error message
Returns:
AMLResponse -- response object for the error
"""
resp = AMLResponse(json.dumps({"error": err_msg}), 400)
resp.headers['Access-Control-Allow-Origin'] = "*"
resp.headers['Content-Type'] = "application/json"
return resp
def success_response(content):
"""Returns a success response with the given content
Arguments:
content {any} -- any json serializable data type to send to
the client
Returns:
AMLResponse -- response object for the success
"""
resp = AMLResponse(json.dumps({"results": content}), 200)
resp.headers['Access-Control-Allow-Origin'] = "*"
resp.headers['Content-Type'] = "application/json"
return resp
@rawhttp
def run(request):
global e
print(request)
if request.method == 'POST':
try:
print("HELLO")
request_data = json.loads(request.data.decode('utf-8'))
if request_data['original']: #if original!=None
e = precompute(request_data['original'])
explained_pic = test_match(request_data['match'],e)
return success_response(explained_pic)
except Exception as err:
traceback.print_exc()
return error_response(str(err))
else: # unsupported http method
return error_response("invalid http request method")

Просмотреть файл

@ -1,205 +0,0 @@
import os
from io import BytesIO
import json
import os
import traceback
from io import BytesIO
import numpy as np
import requests
import tensorflow as tf
from PIL import Image
from azureml.contrib.services.aml_request import rawhttp
from azureml.contrib.services.aml_response import AMLResponse
from keras.applications.resnet50 import ResNet50, preprocess_input
from pyspark.sql import SparkSession
ALL_CLASSIFICATIONS = {'prints', 'drawings', 'ceramics', 'textiles', 'paintings', 'accessories', 'photographs', "glass",
"metalwork", "sculptures", "weapons", "stone", "precious", "paper", "woodwork", "leatherwork",
"musical instruments", "uncategorized"}
ALL_CULTURES = {'african (general)', 'american', 'ancient american', 'ancient asian', 'ancient european',
'ancient middle-eastern', 'asian (general)',
'austrian', 'belgian', 'british', 'chinese', 'czech', 'dutch', 'egyptian', 'european (general)',
'french',
'german', 'greek',
'iranian', 'italian', 'japanese', 'latin american', 'middle eastern', 'roman', 'russian', 'south asian',
'southeast asian',
'spanish', 'swiss', 'various'}
def assert_gpu():
"""
This function will raise an exception if a GPU is not available to tensorflow.
"""
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
def init():
global culture_model
global classification_model
global metadata
global keras_model
os.environ["CUDA_VISIBLE_DEVICES"] = str(0)
assert_gpu()
print("Initializing Spark")
# downloading java dependencies
print(os.environ.get("JAVA_HOME", "WARN: No Java home found"))
SparkSession.builder \
.master("local[*]") \
.appName("TestConditionalBallTree") \
.config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1-38-a6970b95-SNAPSHOT") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.config("spark.driver.memory", "32g") \
.config("spark.executor.heartbeatInterval", "60s") \
.getOrCreate()
print("Spark Initialized")
from mmlspark.nn.ConditionalBallTree import ConditionalBallTree
print("Downloading Models")
if not os.path.exists("medium.ball"):
print("Downloading medium")
os.system('wget https://mmlsparkdemo.blob.core.windows.net/mosaic/medium.ball')
print("downloaded medium")
if not os.path.exists('culture.ball'):
print("Downloading culture")
os.system('wget https://mmlsparkdemo.blob.core.windows.net/mosaic/culture.ball')
print("downloaded culture")
# initialize the model architecture and load in imagenet weights
culture_model = ConditionalBallTree.load('culture.ball')
classification_model = ConditionalBallTree.load('medium.ball')
# Model for featurizing
keras_model = ResNet50(
input_shape=[225, 225, 3],
weights='imagenet',
include_top=False,
pooling='avg'
)
def get_similar_images(img, culture=None, classification=None, n=5):
"""Return an n-size array of image objects similar to the pillow image provided
using the culture or classification as a filter. If no filter is given, it filters on
all known classifications.
Arguments:
img {Image} -- Pillow image to compare to
culture {str} -- string of the culture to filter
classification {str} -- string of the classification to filter
n {int} -- number of results to return
Returns:
dict[] -- array of dictionaries representing artworks that are similar
"""
# Non RGB images won't have the right number of channels
if img.mode != 'RGB':
img = img.convert('RGB')
img = np.array(img) # PIL -> numpy
img = np.expand_dims(img, axis=0)
img = preprocess_input(img.astype(np.float))
features = keras_model.predict(img) # featurize
features /= np.linalg.norm(features)
img_feature = features[0]
img_feature = img_feature.tolist()
# Get results based upon the filter provided
if culture is not None:
result = culture_model.findMaximumInnerProducts(
img_feature,
{culture},
n
)
selected_model = culture_model
elif classification is not None:
result = classification_model.findMaximumInnerProducts(
img_feature,
{classification},
n
)
selected_model = classification_model
else:
result = classification_model.findMaximumInnerProducts(
img_feature,
ALL_CLASSIFICATIONS,
n
)
selected_model = classification_model
results_with_data = []
for r in result:
row = selected_model._jconditional_balltree.values().apply(r[0])
dist = r[1]
results_with_data.append([json.loads(row), dist])
return results_with_data
def error_response(err_msg):
"""Returns an error response for a given error message
Arguments:
err_msg {str} -- error message
Returns:
AMLResponse -- response object for the error
"""
resp = AMLResponse(json.dumps({"error": err_msg}), 400)
resp.headers['Access-Control-Allow-Origin'] = "*"
resp.headers['Content-Type'] = "application/json"
return resp
def success_response(content):
"""Returns a success response with the given content
Arguments:
content {any} -- any json serializable data type to send to
the client
Returns:
AMLResponse -- response object for the success
"""
resp = AMLResponse(json.dumps({"results": content}), 200)
resp.headers['Access-Control-Allow-Origin'] = "*"
resp.headers['Content-Type'] = "application/json"
return resp
@rawhttp
def run(request):
print(request)
if request.method == 'POST':
try:
request_data = json.loads(request.data.decode('utf-8'))
response = requests.get(request_data['url']) # URL -> response
img = Image.open(BytesIO(response.content)).resize((225, 225)) # response -> PIL
query = request_data.get('query', None)
culture = query if query in ALL_CULTURES else None
classification = query if query in ALL_CLASSIFICATIONS else None
similar_images = get_similar_images(
img,
culture=culture,
classification=classification,
n=int(request_data['n'])
)
return success_response(similar_images)
except Exception as err:
traceback.print_exc()
return error_response(str(err))
else: # unsupported http method
return error_response("invalid http request method")

Просмотреть файл

@ -5,11 +5,11 @@ import base64
import os
import subprocess
import sys
from io import BytesIO
def install(package):
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
install("azure-storage-blob")
install("flask-cors")
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, ContentSettings
@ -24,48 +24,53 @@ container_name = "mosaic-shares"
BAD_REQUEST_STATUS_CODE = 400
NOT_FOUND_STATUS_CODE = 404
def allowed_file(filename):
return '.' in filename and \
filename.rsplit('.', 1)[1].lower() in {'png', 'jpg', 'jpeg'}
filename.rsplit('.', 1)[1].lower() in {'png', 'jpg', 'jpeg'}
app = Flask(__name__)
CORS(app)
@app.route('/', methods=['GET'])
def home():
return jsonify({ "status": "ok", "version": "1.0.0" })
return jsonify({"status": "ok", "version": "1.0.0"})
@app.route('/upload', methods=['POST'])
def upload():
# frontend uploads image, we save to azure storage blob and return a link to the image and the share page
if request.method == 'POST':
if request.args.get("filename") is None:
return jsonify({ "error": "filename parameter must be specified" })
return jsonify({"error": "filename parameter must be specified"})
filename = request.args.get("filename")
content_type = None
try:
img_b64 = request.form.get('image').split(',')
image = base64.b64decode(img_b64[1])
content_type = img_b64[0].split(':')[1].split(';')[0] # gets content type from data:image/png;base64
content_type = img_b64[0].split(':')[1].split(';')[0] # gets content type from data:image/png;base64
except:
return jsonify({ "error": "unable to decode"})
return jsonify({"error": "unable to decode"})
if allowed_file(filename):
filename = secure_filename(filename)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=filename)
try:
blob_client.upload_blob(image)
blob_client.set_http_headers(content_settings = ContentSettings(content_type=content_type))
blob_client.set_http_headers(content_settings=ContentSettings(content_type=content_type))
print(content_type)
except Exception as err:
print(err)
finally:
img_url = "https://mmlsparkdemo.blob.core.windows.net/mosaic-shares/mosaic-shares/" + filename
return jsonify({ "img_url": img_url })
return jsonify({"img_url": img_url})
return jsonify({"error": "error processing file"})
else:
return jsonify({"error": "upload is a post request"})
@app.route('/share', methods=['GET'])
def share():
image_url = request.args.get('image_url')
@ -85,5 +90,6 @@ def share():
height=height
)
if __name__ == "__main__":
app.run(debug=True, host='0.0.0.0')
app.run(debug=True, host='0.0.0.0')

Просмотреть файл

3
data science/.gitignore поставляемый
Просмотреть файл

@ -1,3 +0,0 @@
dataset
dataset_small
metadata

Просмотреть файл

@ -1,264 +0,0 @@
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image
import numpy as np
import pickle
import time
import json
import tqdm
import os
import torch.nn as nn
from torch.utils import data
import torchvision.datasets as datasets
from torchvision.models import SqueezeNet, ResNet
def load_dataset(batch_size):
""" Loads the dataset from the dataset/fonts folder with the specified batch size. """
data_path = 'dataset/art_dirs/'
train_dataset = datasets.ImageFolder(
root=data_path,
transform=transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
)
# think about other possible transforms to scale to 224x224? random crop, etc.
train_loader = data.DataLoader(
train_dataset,
batch_size=batch_size,
num_workers=6,
shuffle=False
)
return train_loader
def cosine_similarity(a, b):
""" Computes the cosine similarity between two vectors using PyTorch's built-in function. """
cos = nn.CosineSimilarity(dim=1, eps=1e-6)
cos_sim = cos(a.unsqueeze(0), b.unsqueeze(0))
# cos_sim = cos(a, b)
# print('\nCosine similarity: {0}\n'.format(cos_sim))
return cos_sim
def get_feature_vector(imgs, model, layer):
""" Extracts the feature vectors from a particular model at a particular layer, given a Tensor of images. """
def get_vector(imgs):
def vector_size(layer):
for param in layer.parameters():
return param.shape[1]
def copy_data(m, i, o):
my_embedding.copy_(o.data.squeeze())
t_imgs = Variable(imgs)
if isinstance(model, SqueezeNet):
my_embedding = torch.zeros(imgs.shape[0], 1000)
elif isinstance(model, ResNet):
my_embedding = torch.zeros(imgs.shape[0], vector_size(model._modules.get('fc'))) # extracts size of vector from the shape of the FC layer
h = layer.register_forward_hook(copy_data)
model(t_imgs)
h.remove()
return my_embedding
img_vector = get_vector(imgs).numpy()
return img_vector
# def get_sample_filenames():
# # Small dataset to do testing things
# img1 = "datasets/animals/birb.jpg"
# img2 = "datasets/animals/birb2.jpg"
# img3 = "datasets/animals/bork.jpg"
# img4 = "datasets/animals/snek.jpg"
# imgs = [img1, img2, img3, img4]
# return imgs
def get_filenames():
""" Extracts filenames from the dataset/fonts folder. """
# TODO replace with better data reading?
total_filenames = []
root_dir = "dataset/art_dirs/"
for root, dirs, files in os.walk(root_dir):
for file in files:
total_filenames.append(os.path.join(root, file).replace('\\', '/'))
return total_filenames
def get_metadata(filenames):
""" Extracts metadata from filenames and writes them to a .pkl file. """
total_metadata = []
for filename in filenames:
with open(filename) as f:
metadata = json.load(f)
total_metadata.append(metadata)
with open("metadata.json", 'w') as outfile:
json.dump(total_metadata, outfile)
def process_images(model, dataset):
""" Extracts all the feature vectors of a particular dataset using a particular model and writes them to a .pkl file. """
# start = time.time()
total_features = []
model.cuda()
if isinstance(model, ResNet):
layer = model._modules.get('avgpool')
elif isinstance(model, SqueezeNet):
layer = list(model.children())[-1][-1]
model.eval()
print(model.name)
count = 0
for imgs, label in dataset:
try:
feature_vector = get_feature_vector(imgs.cuda(), model, layer)
total_features.extend(feature_vector)
if count%120 == 0:
print(count)
count += 1
except:
print("error loading: {}".format(label))
total_numpy_features = np.array(total_features)
pickle.dump(total_numpy_features, open("dataset/art_features-" + model.name + ".pkl", "wb"))
# end = time.time()
# print(end - start)
# print(model.name)
def run_models():
""" Extracts the feature vectors from the dataset from a variety of different models. """
dataset = load_dataset(32)
# resnet18 = models.resnet18(pretrained=True)
# resnet34 = models.resnet34(pretrained=True)
# resnet50 = models.resnet50(pretrained=True)
# resnet101 = models.resnet101(pretrained=True)
# resnet152 = models.resnet152(pretrained=True)
# inception = models.inception_v3(pretrained=True)
# googlenet = models.googlenet(pretrained=True)
# shufflenet = models.shufflenet_v2_x1_0(pretrained=True)
# resnext50_32x4d = models.resnext50_32x4d(pretrained=True)
# resnext101_32x8d = models.resnext101_32x8d(pretrained=True)
# wide_resnet101_2 = models.wide_resnet101_2(pretrained=True)
# wide_resnet50_2 = models.wide_resnet50_2(pretrained=True)
# alexnet = models.alexnet(pretrained=True)
squeezenet = models.squeezenet1_1(pretrained=True)
# vgg16 = models.vgg16(pretrained=True)
# densenet = models.densenet161(pretrained=True)
# mobilenet = models.mobilenet_v2(pretrained=True)
# mnasnet = models.mnasnet1_0(pretrained=True)
# resnet18.name = "resnet18"
# resnet34.name = "resnet34"
# resnet50.name = "resnet50"
# resnet101.name = "resnet101"
# resnet152.name = "resnet152"
# inception.name = "inception"
# googlenet.name = "googlenet"
# shufflenet.name = "shufflenet"
# resnext50_32x4d.name = "resnext50_32x4d"
# resnext101_32x8d.name = "resnext101_32x8d"
# wide_resnet101_2.name = "wide_resnet101_2"
# wide_resnet50_2.name = "wide_resnet50_2"
# alexnet.name = "alexnet"
squeezenet.name = "squeezenet"
# vgg16.name = "vgg16"
# densenet.name = "densenet"
# mobilenet.name = "mobilenet"
# mnasnet.name = "mnasnet"
print("Models created")
# # only needs to be run once, commented out for now
# filenames = get_filenames()
# get_metadata(filenames)
# print("Metadata done")
# all_models = [resnet18, alexnet, squeezenet, vgg16, densenet, inception, googlenet,
# shufflenet, mobilenet, resnext50_32x4d, wide_resnet50_2, mnasnet]
all_models = [squeezenet]
for model in all_models:
process_images(model, dataset)
print(model.name + " done")
def benchmark_different():
models = ["datasets/features-resnet18.pkl", "datasets/features-resnet34.pkl", "datasets/features-resnet50.pkl", "datasets/features-resnet101.pkl", "datasets/features-resnext50_32x4d.pkl", "datasets/features-wide_resnet50_2.pkl"]
# models = ["datasets/features-resnext50_32x4d.pkl"]
for model in models:
vectors = pickle.load(open(model, "rb"))
sum = 0
for i in range(248):
for j in range(248):
sum += cosine_similarity(torch.from_numpy(vectors[i]), torch.from_numpy(vectors[j]))
print(model)
print(sum / 61504)
if __name__ == '__main__':
torch.multiprocessing.freeze_support()
run_models()
# print(torch.cuda.is_available())
# benchmark_different()
# print(pickle.load(open("datasets/metadata.pkl", "rb")))
# vectors = pickle.load(open("datasets/features-resnet18.pkl", "rb"))
# print(cosine_similarity(torch.from_numpy(vectors[10]), torch.from_numpy(vectors[11])))
# datasets/features-resnet18.pkl
# tensor([0.8278])
# datasets/features-resnet34.pkl
# tensor([0.8308])
# datasets/features-resnet50.pkl
# tensor([0.8584])
# datasets/features-resnet101.pkl
# tensor([0.8729])
# datasets/features-resnext50_32x4d.pkl
# tensor([0.8269])
# datasets/features-wide_resnet50_2.pkl
# tensor([0.8112])
# datasets/features-resnet18.pkl
# tensor([0.8796])
# datasets/features-resnet34.pkl
# tensor([0.8815])
# datasets/features-resnet50.pkl
# tensor([0.9034])
# datasets/features-resnet101.pkl
# tensor([0.9034])
# datasets/features-resnext50_32x4d.pkl
# tensor([0.8826])
# datasets/features-wide_resnet50_2.pkl
# tensor([0.8732])

Просмотреть файл

@ -1,24 +0,0 @@
"""Given json files in a folder, output a json list of all files in alphabetical order by filename"""
import os
from tqdm import tqdm
import json
FOLDER_PATH = "./dataset/metadata"
total_filenames = []
total_metadata = []
for root, dirs, files in os.walk(FOLDER_PATH):
for file in tqdm(files):
total_filenames.append(os.path.join(root, file).replace('\\', '/'))
for filename in tqdm(total_filenames):
with open(filename, encoding="utf8") as jsonfile:
try:
metadata = json.load(jsonfile)
total_metadata.append(metadata)
except:
print("Error parsing: {}".format(filename))
with open("metadata.json", "w") as outfile:
json.dump(total_metadata, outfile)

Просмотреть файл

@ -1,78 +0,0 @@
import torch
import torch.nn as nn
from torch.utils import data
import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image
def load_dataset(batch_size):
data_path = './dataset'
train_dataset = datasets.ImageFolder(
root=data_path,
transform=transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
)
# think about other possible transforms to scale to 224x224? random crop, etc.
train_loader = data.DataLoader(
train_dataset,
batch_size=batch_size,
num_workers=6,
shuffle=False
)
return train_loader
# Load the pretrained model
model = models.resnet18(pretrained=True)
# Use CUDA to utilize the GPU
print(torch.cuda.is_available())
model.cuda()
# Use the model object to select the desired layer
layer = model._modules.get('avgpool')
# Set model to evaluation mode
model.eval()
def image_vectors(imgs):
def vector_size(layer):
for param in layer.parameters():
return param.shape[1]
def copy_data(m, i, o):
my_embedding.copy_(o.data.squeeze())
t_imgs = Variable(imgs)
my_embedding = torch.zeros(imgs.shape[0], vector_size(model._modules.get('fc')))
h = layer.register_forward_hook(copy_data)
model(t_imgs)
h.remove()
return my_embedding
def cosine_similarity(a, b):
# Using PyTorch Cosine Similarity
cos = nn.CosineSimilarity(dim=1, eps=1e-6)
cos_sim = cos(a.unsqueeze(0), b.unsqueeze(0))
# print('\nCosine similarity: {0}\n'.format(cos_sim))
return cos_sim
if __name__ == '__main__':
torch.multiprocessing.freeze_support()
batch_size = 64
dataset = load_dataset(batch_size)
for batch, labels in dataset:
a = image_vectors(batch.cuda()) # features, matrix of dimensions (batch size) x (vector size)
b = labels
# test on dogs and cats
for pair in [(0, 1), (2, 3), (0, 2), (1, 3)]:
print(cosine_similarity(a[pair[0]], a[pair[1]]))

Просмотреть файл

@ -1,15 +0,0 @@
import os
def extract_image_type(filename):
split_name = int(filename.split("_")[2])
return split_name
base = "C://Users/v-stfu/Documents/GitHub/art/data science/dataset/"
for root, dirs, files in os.walk(base + "art"):
for file in files:
image_type = extract_image_type(file)
if not os.path.exists(base + "art_dirs/" + str(image_type)):
os.mkdir(base + "art_dirs/" + str(image_type))
os.rename("C://Users/v-stfu/Documents/GitHub/art/data science/dataset/art/" + file,
"C://Users/v-stfu/Documents/GitHub/art/data science/dataset/art_dirs/" + str(image_type) + "/" + file)

Двоичные данные
data science/test/cats/cat.jpg

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 48 KiB

Двоичные данные
data science/test/cats/kitten.jpg

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 69 KiB

Двоичные данные
data science/test/dogs/dog.jpg

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 70 KiB

Двоичные данные
data science/test/dogs/doggy.jpg

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 2.8 MiB

Просмотреть файл

@ -0,0 +1,23 @@
import os
import pandas as pd
import requests
from tqdm import tqdm
batch_size = 256
img_width = 225
img_height = 225
model = "resnet"
metadata_fn = "metadata.json"
data_dir = "images"
os.makedirs(data_dir, exist_ok=True)
metadata = pd.read_json(metadata_fn, lines=True)
for i, row in tqdm(metadata.iterrows()):
target_file = os.path.join(data_dir, row["id"] + ".jpg")
if not os.path.exists(target_file):
try:
with open(target_file, 'wb') as f:
f.write(requests.get(row["Thumbnail_Url"]).content)
except Exception as e:
print(e)

Просмотреть файл

@ -0,0 +1,128 @@
import os
import pandas as pd
import torch
import torch.nn as nn
from PIL import Image
from torch.utils.data import Dataset
from torchvision import models, transforms
from tqdm import tqdm
import numpy as np
import pickle
data_dir = 'images'
metadata_fn = "metadata.json"
features_dir = "features"
features_file = os.path.join(features_dir, "pytorch_rn50.pkl")
featurize_images = True
device = torch.device("cuda:0")
os.makedirs(data_dir, exist_ok=True)
os.makedirs(features_dir, exist_ok=True)
if featurize_images:
class ArtDataset(Dataset):
"""Face Landmarks dataset."""
def __init__(self, metadata_json, image_dir, transform):
"""
Args:
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.metadata = pd.read_json(metadata_json, lines=True)
self.image_dir = image_dir
self.transform = transform
def __len__(self):
return len(self.metadata)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
metadata = self.metadata.iloc[idx]
with open(os.path.join(self.image_dir, metadata["id"] + ".jpg"), "rb") as f:
image = Image.open(f).convert("RGB")
return self.transform(image), metadata["id"]
data_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
dataset = ArtDataset(metadata_fn, data_dir, data_transform)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=False, num_workers=4)
dataset_size = len(dataset)
# Get a batch of training data
model = models.resnet50(pretrained=True)
model.eval()
model.to(device)
cut_model = nn.Sequential(*list(model.children())[:-1])
all_outputs = []
all_ids = []
for i, (inputs, ids) in enumerate(tqdm(data_loader)):
inputs = inputs.to(device)
outputs = torch.squeeze(cut_model(inputs)).detach().cpu().numpy()
all_outputs.append(outputs)
all_ids.append(list(ids))
all_outputs = np.concatenate(all_outputs, axis=0)
all_ids = np.concatenate(all_ids, axis=0)
with open(features_file, "wb+") as f:
pickle.dump((all_outputs, all_ids), f)
with open(features_file, "rb") as f:
with torch.no_grad():
(all_outputs, all_ids) = pickle.load(f)
all_urls = np.array(pd.read_json(metadata_fn, lines=True).loc[:, "Thumbnail_Url"])
features = torch.from_numpy(all_outputs).float().to("cpu:0")
features = features / torch.sqrt(torch.sum(features ** 2, dim=1, keepdim=True))
features = features.to(device)
indicies = torch.arange(0, features.shape[0]).to(device)
print("loaded features")
metadata = pd.read_json(metadata_fn, lines=True)
culture_arr = np.array(metadata["Culture"])
cultures = metadata.groupby("Culture").count()["id"].sort_values(ascending=False).index.to_list()
media_arr = np.array(metadata["Classification"])
media = metadata.groupby("Classification").count()["id"].sort_values(ascending=False).index.to_list()
ids = np.array(metadata["id"])
masks = {"culture": {}, "medium": {}}
for culture in cultures:
masks["culture"][culture] = torch.from_numpy(culture_arr == culture).to(device)
for medium in media:
masks["medium"][medium] = torch.from_numpy(media_arr == medium).to(device)
all_matches = []
for i, row in tqdm(metadata.iterrows()):
feature = features[i]
matches = {"culture": {}, "medium": {}}
all_dists = torch.sum(features * feature, dim=1).to(device)
for culture in cultures:
selected_indicies = indicies[masks["culture"][culture]]
k = min(10, selected_indicies.shape[0])
dists, inds = torch.topk(all_dists[selected_indicies], k, sorted=True)
matches["culture"][culture] = ids[selected_indicies[inds].cpu().numpy()]
for medium in media:
selected_indicies = indicies[masks["medium"][medium]]
k = min(10, selected_indicies.shape[0])
dists, inds = torch.topk(all_dists[selected_indicies], k, sorted=True)
matches["medium"][medium] = ids[selected_indicies[inds].cpu().numpy()]
all_matches.append(matches)
metadata["matches"] = all_matches
metadata.to_json("results/metadata_enriched.json")
print("here")