diff --git a/SampleTimeSeries/Covid-19 India Pred/Trigger_Final(Covid).ipynb b/SampleTimeSeries/Covid-19 India Pred/Trigger_Final(Covid).ipynb deleted file mode 100644 index 2a8ec0d..0000000 --- a/SampleTimeSeries/Covid-19 India Pred/Trigger_Final(Covid).ipynb +++ /dev/null @@ -1 +0,0 @@ -{"cells":[{"cell_type":"markdown","source":["**PROBLEM STATEMENT**\n
Predict the Active Covid cases of 2021 depending on 2020 statewise data in India. \n
Get Sample data from Source- https://www.kaggle.com/aritranandi23/covid-19-analysis-and-prediction/data\n
\n
**COLUMN DEFINITION**\n
Date:string\n
Time:string\n
State/UnionTerritory:string\n
ConfirmedIndianNational:string\n
ConfirmedForeignNational:string\n
Cured:integer\n
Deaths:integer\n
Confirmed:integer\n\n
**STEPS IN MODELLING**\n
1.Data Acquisation\n
2.Data understanding\n
3.Data visualisation/EDA\n
4.Data cleaning/missing imputation/typecasting\n
5.Sampling/ bias removal\n
6.Anomaly detection\n
7.Feature selection/importance\n
8.Azure ML Model trigger\n
9.Model Interpretation\n
10.Telemetry\n
\n
\n
**FEATURE ENGINEERING**\n
1.Get Death rates, Discharge rates, Active cases rates from Confirmed, Cured, Death cases \n
2.In TS Forecasting each group must provide atleast 3 datapoints to obtain frequency, remove the records where frequency<=3 for the train set i.e. data from the year 2020"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1b17cc37-22a5-4d5a-8bb5-dea3155d53b5"}}},{"cell_type":"markdown","source":["## Import functions from Master Notebook:\nImport the Functions and dependencies from the Master notebook to be used in the Trigger Notebook"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5af69aec-6302-4466-a9f7-311d01f8e490"}}},{"cell_type":"code","source":["%run /Users/.../AMLMasterNotebook"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5fb62253-5520-4e99-927b-37f012518930"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"markdown","source":["## 1.Data Acquisition\n1.Acquisition of data from datasource ADLS path in CSV/Parquet/JSON etc format.\n
2.Logical Transformations in data. \n
3.Transforming columns into required datatypes, converting to pandas df, persisiting actual dataset, intoducing a column 'Index' to assign a unique identifier to each dataset row so that this canm be used to retrieve back the original form after any data manupulations."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"762377e8-f5f6-4108-8bfb-c3ac39a33503"}}},{"cell_type":"code","source":["%scala\n//\n\nval filepath1= \"adl://.azuredatalakestore.net/Temp/ML-PJC/covid_19_india.csv\"\nvar df1=spark.read.format(\"com.databricks.spark.csv\").option(\"inferSchema\", \"true\").option(\"header\", \"true\").option(\"delimiter\", \",\").load(filepath1)\ndf1.createOrReplaceTempView(\"CovidIndia\")\n\n\n\nval filepath2= \"adl://.azuredatalakestore.net/Temp/ML-PJC/covid_vaccine_statewise.csv\"\nvar df2=spark.read.format(\"com.databricks.spark.csv\").option(\"inferSchema\", \"true\").option(\"header\", \"true\").option(\"delimiter\", \",\").load(filepath2)\ndf2.createOrReplaceTempView(\"Vaccine\")\n\nval filepath3= \"adl://.azuredatalakestore.net/Temp/ML-PJC/StatewiseTestingDetails.csv\"\nvar df3=spark.read.format(\"com.databricks.spark.csv\").option(\"inferSchema\", \"true\").option(\"header\", \"true\").option(\"delimiter\", \",\").load(filepath3)\ndf3.createOrReplaceTempView(\"Testing\")\n\n"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Read Data from Lake","showTitle":true,"inputWidgets":{},"nuid":"6c2e9649-413e-4b80-b891-8ec67aa3c086"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["%sql\nwith CTE1 as\n(select\nMAX(date_format(Date, 'MMMM')) AS Month\n,MAX(Year(Date)) AS Year\n,CONCAT(Year(Date),'-',RIGHT(CONCAT('00',MONTH(Date)),2),'-','01') AS Date\n,`State/UnionTerritory` as State\n,SUM(Cured) as Cured\n,SUM(Deaths) as Deaths\n,SUM(Confirmed) as Confirmed\n,((SUM(Confirmed)-SUM(Deaths)-SUM(Cured))/SUM(Confirmed) * 100) as ActiveCasesRate\n,(SUM(Cured)/SUM(Confirmed) * 100) AS DischargeRate\n,(SUM(Deaths)/SUM(Confirmed) * 100) as DeathsRate\nfrom CovidIndia C\ngroup by \nCONCAT(Year(Date),'-',RIGHT(CONCAT('00',MONTH(Date)),2),'-','01')\n,`State/UnionTerritory`\n)\n\n--select distinct state from CTE1 group by State having count(*)<=3\nselect * from CTE1 where State not in (select distinct state from CTE1 group by State having count(*)<=3)\nand Year=2021"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"9ea2ad72-bf15-4b22-9358-927ee6af323e"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["df= spark.sql(\"\"\"\nwith CTE1 as\n(select\nMAX(date_format(Date, 'MMMM')) AS Month\n,MAX(Year(Date)) AS Year\n,CONCAT(Year(Date),'-',RIGHT(CONCAT('00',MONTH(Date)),2),'-','01') AS Date\n,`State/UnionTerritory` as State\n,SUM(Cured) as Cured\n,SUM(Deaths) as Deaths\n,SUM(Confirmed) as Confirmed\n,((SUM(Confirmed)-SUM(Deaths)-SUM(Cured))/SUM(Confirmed) * 100) as ActiveCasesRate\n,(SUM(Cured)/SUM(Confirmed) * 100) AS DischargeRate\n,(SUM(Deaths)/SUM(Confirmed) * 100) as DeathsRate\nfrom CovidIndia C\ngroup by \nCONCAT(Year(Date),'-',RIGHT(CONCAT('00',MONTH(Date)),2),'-','01')\n,`State/UnionTerritory`\n)\n\nselect * from CTE1 \nwhere State not like 'Lakshadweep' and --Low Frequency for train as only one month dec in 2020 train\nState not in (select distinct state from CTE1 group by State having count(*)<=3) --low frequency for train as atleast three months required \"\"\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Logical transformation of data","showTitle":true,"inputWidgets":{},"nuid":"dda179bc-7139-4f6b-b493-16c174561500"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["import pandas as pd\nimport numpy as np\nfrom pyspark.sql.functions import col\n##df.select(*(col(c).cast(\"float\").alias(c) for c in df.columns))\n#cols=df.columns\n#cols.remove('Index')\n#for col_name in cols:\n# df = df.withColumn(col_name, col(col_name).cast('float'))\n#for col_name in ['Index']:\n# df = df.withColumn(col_name, col(col_name).cast('Int')) \n\n# \ncols_all=[\n'Month'\n,'Year'\n,'Date'\n,'State'\n,'Cured'\n,'Deaths'\n,'Confirmed'\n,'ActiveCasesRate'\n,'DischargeRate'\n,'DeathsRate'\n]\ncols_string=[\n'Month'\n,'Year'\n,'Date'\n,'State'\n]\ncols_int=[\n'Cured'\n,'Deaths'\n,'Confirmed'\n]\ncols_bool=[]\ncols_Float=[\n'ActiveCasesRate'\n,'DischargeRate'\n,'DeathsRate'\n]\nfor col_name in cols_int:\n df = df.withColumn(col_name, col(col_name).cast('Int')) \nfor col_name in cols_Float:\n df = df.withColumn(col_name, col(col_name).cast('float')) \nfor col_name in cols_bool:\n df = df.withColumn(col_name, col(col_name).cast('bool')) \n \ninput_dataframe = df.toPandas()\ninput_dataframe['Index'] = np.arange(len(input_dataframe))\noutdir = '/dbfs/FileStore/Covid.csv'\ninput_dataframe.to_csv(outdir, index=False)\n#input_dataframe = pd.read_csv(\"/dbfs/FileStore/Dataframe.csv\", header='infer')\n"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Data columns and structural transformation","showTitle":true,"inputWidgets":{},"nuid":"96552a0b-6af9-42f9-bdfe-8accec92b64a"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"markdown","source":["## 2.Data Exploration\n1.Exploratory Data Analysis (EDA)- To understand the overall data at hand, analysing each feature independently for its' statistics, the correlation and interraction between variables, data sample etc. \n
2.Data Profiling Plots- To analyse the Categorical and Numerical columns separately for any trend in data, biasness in data etc."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"3271e3ed-4236-4190-8565-719d83710eb2"}}},{"cell_type":"code","source":["input_dataframe = pd.read_csv(\"/dbfs/FileStore/Covid.csv\", header='infer')\n\np=Data_Profiling_viaPandasProfiling(input_dataframe,'RealEstate','EDA')\ndisplayHTML(p)\n"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"EDA","showTitle":true,"inputWidgets":{},"nuid":"29b65c27-980f-4981-bf45-0b6e124a92c2"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["input_dataframe = pd.read_csv(\"/dbfs/FileStore/Covid.csv\", header='infer')\n\n#User Inputs\ncols_all=[\n'Month'\n,'Year'\n,'Date'\n,'State'\n,'Cured'\n,'Deaths'\n,'Confirmed'\n,'ActiveCasesRate'\n,'DischargeRate'\n,'DeathsRate'\n]\nCategorical_cols=['Month'\n,'Year'\n,'Date'\n,'State'\n]\nNumeric_cols=['Cured'\n,'Deaths'\n,'Confirmed'\n,'ActiveCasesRate'\n,'DischargeRate'\n,'DeathsRate'\n]\nLabel_col='ActiveCasesRate'\n\n#Data_Profiling_Plots(input_dataframe,Categorical_cols,Numeric_cols,Label_col)\nData_Profiling_Plots(input_dataframe,Categorical_cols,Numeric_cols,Label_col)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Data Profiling Plots","showTitle":true,"inputWidgets":{},"nuid":"ac56b8e6-d004-48fe-a01c-bc9c2e9fcf9a"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"markdown","source":["## 4.Cleansing\nTo clean the data from NULL values, fix structural errors in columns, drop empty columns, encode the categorical values, normalise the data to bring to the same scale. We also check the Data Distribution via Correlation heatmap of original input dataset v/s the Cleansed dataset to validate whether or not the transformations hampered the original data trend/density."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"205b31e4-67b0-4b5b-8074-3df162f9fe1f"}}},{"cell_type":"code","source":["subsample_final = pd.read_csv(\"/dbfs/FileStore/Covid.csv\", header='infer')\n#subsample_final=subsample_final.drop(['Index'], axis = 1) # Index is highest variability column hence always imp along PC but has no business value. You can append columns to be dropped by your choice here in the list\n\ninputdf_new=autodatacleaner(subsample_final,\"/dbfs/FileStore/Covid.csv\",\"Covid\",\"Data Cleanser\")\nprint(\"Total rows in the new pandas dataframe:\",len(inputdf_new.index))\n\n#persist cleansed data sets \nfilepath1 = '/dbfs/FileStore/Cleansed_Covid.csv'\ninputdf_new.to_csv(filepath1, index=False)\n"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Auto Cleanser ","showTitle":true,"inputWidgets":{},"nuid":"08f53513-9939-462b-bd37-97cdec8f8349"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["subsample_final = pd.read_csv(\"/dbfs/FileStore/Covid.csv\", header='infer')\n\ndisplay(Data_Profiling_Fin(subsample_final))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Data profiling(Heatmap correlation)- User input dataframe","showTitle":true,"inputWidgets":{},"nuid":"54c54d7e-6749-47e7-a2cd-c6318255994e"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["Cleansed=pd.read_csv(\"/dbfs/FileStore/Cleansed_Covid.csv\", header='infer')\n\ndisplay(Data_Profiling_Fin(Cleansed))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Data profiling(Heatmap correlation)- Cleansed dataframe","showTitle":true,"inputWidgets":{},"nuid":"636d95dd-73ee-46f4-a0b3-f794b094b4ff"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"markdown","source":["## 5.Anomaly Detection\nIterate data over various Anomaly-detection techniques and estimate the number of Inliers and Outliers for each."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"667e758b-1cb9-4f3b-84f0-209cce9f20f0"}}},{"cell_type":"code","source":["import pandas as pd\nimport numpy as np\nfrom scipy import stats\nimport matplotlib.pyplot as plt\n%matplotlib inline\nimport matplotlib.font_manager\nfrom pyod.models.abod import ABOD\nfrom pyod.models.knn import KNN\nfrom pyod.utils.data import generate_data, get_outliers_inliers\nfrom pyod.models.cblof import CBLOF\n#from pyod.models.feature_bagging import FeatureBagging\nfrom pyod.models.hbos import HBOS\nfrom pyod.models.iforest import IForest\nfrom pyod.models.lof import LOF\nfrom sklearn.preprocessing import MinMaxScaler\nimport warnings\nfrom io import BytesIO\nfrom pyspark.sql.functions import base64\nfrom pyspark.sql.functions import unbase64\nwarnings.filterwarnings(\"ignore\")\noutliers_fraction = 0.05\n#df = pd.read_csv(\"/kaggle/input/house-prices-advanced-regression-techniques/train.csv\")\n#target_variable = 'SalePrice'\n#variables_to_analyze = '1stFlrSF'\noutput_path = '/dbfs/FileStore/AnomalyDetection_HTML'\n#df.plot.scatter('1stFlrSF','SalePrice')\ndef AnomalyDetection(df,target_variable,variables_to_analyze,outliers_fraction,input_appname,task_type):\n import time\n from datetime import date\n today = date.today()\n ts = int(time.time())\n appname = input_appname\n appnamequotes = \"'%s'\" % appname\n tsquotes = \"'%s'\" % str(ts)\n task = \"'%s'\" % str(task_type)\n \n #Scale the data is required to create a explainable visualization (it will become way too stretched otherwise)\n scaler = MinMaxScaler(feature_range=(0, 1))\n df[[target_variable,variables_to_analyze]] = scaler.fit_transform(df[[target_variable,variables_to_analyze]])\n X1 = df[variables_to_analyze].values.reshape(-1,1)\n X2 = df[target_variable].values.reshape(-1,1)\n X = np.concatenate((X1,X2),axis=1)\n random_state = np.random.RandomState(42)\n # Define seven outlier detection tools to be compared\n classifiers = {\n 'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outliers_fraction),\n 'Cluster-based Local Outlier Factor (CBLOF)':CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=random_state),\n #'Feature Bagging':FeatureBagging(LOF(n_neighbors=35),contamination=outliers_fraction,check_estimator=False,random_state=random_state),\n 'Histogram-base Outlier Detection (HBOS)': HBOS(contamination=outliers_fraction),\n 'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state),\n 'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction),\n 'Average KNN': KNN(method='mean',contamination=outliers_fraction)\n }\n xx , yy = np.meshgrid(np.linspace(0,1 , 200), np.linspace(0, 1, 200))\n for i, (clf_name, clf) in enumerate(classifiers.items()):\n clf.fit(X)\n # predict raw anomaly score\n scores_pred = clf.decision_function(X) * -1\n # prediction of a datapoint category outlier or inlier\n y_pred = clf.predict(X)\n n_inliers = len(y_pred) - np.count_nonzero(y_pred)\n n_outliers = np.count_nonzero(y_pred == 1)\n X2\n # copy of dataframe\n dfx = df\n dfy=df\n dfx['outlier'] = y_pred.tolist()\n dfy['outlier'] = y_pred.tolist()\n dfy['scores_pred'] = scores_pred.tolist()\n dfy[target_variable] = df[target_variable]\n \n\n clf_name_string=\"%s\" % str(clf_name)\n ts_string=\"%s\" % str(ts)\n #OutputfileName=\"adl://psinsightsadlsdev01.azuredatalakestore.net/DEV/AnomalyDetection_\"+clf_name_string +\".csv\"\n #copydbfs = '/dbfs/FileStore/AnomalyDetection.csv'\n #dfy.to_csv(copydbfs, index=False)\n #dbutils.fs.cp (\"/FileStore/AnomalyDetection.csv\", OutputfileName, True) \n n_outliers=\"%s\" % str(n_outliers)\n n_inliers=\"%s\" % str(n_inliers)\n rm_str3 = \"Insert into AutoTuneML.amltelemetry values (\" + appnamequotes + \",\"+ task + \",'OUTLIERS :\" + n_outliers + \" INLIERS :\" + n_inliers + \" :- \" + clf_name+ \"',\" + tsquotes + \")\"\n #spark.sql(rm_str3)\n is_outlier = dfy['outlier']==1\n Outlier_data = dfy[is_outlier]\n html_data = Outlier_data.to_html(classes='table table-striped')\n # IX1 - inlier feature 1, IX2 - inlier feature 2\n IX1 = np.array(dfx[variables_to_analyze][dfx['outlier'] == 0]).reshape(-1,1)\n IX2 = np.array(dfx[target_variable][dfx['outlier'] == 0]).reshape(-1,1)\n # OX1 - outlier feature 1, OX2 - outlier feature 2\n OX1 = dfx[variables_to_analyze][dfx['outlier'] == 1].values.reshape(-1,1)\n OX2 = dfx[target_variable][dfx['outlier'] == 1].values.reshape(-1,1) \n print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name)\n # threshold value to consider a datapoint inlier or outlier\n threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction)\n # decision function calculates the raw anomaly score for every point\n Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1\n Z = Z.reshape(xx.shape)\n plt.figure(figsize=(10, 10))\n # fill blue map colormap from minimum anomaly score to threshold value\n plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)\n # draw red contour line where anomaly score is equal to thresold\n a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')\n # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score\n plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')\n b = plt.scatter(IX1,IX2, c='white',s=20, edgecolor='k')\n c = plt.scatter(OX1,OX2, c='black',s=20, edgecolor='k')\n plt.axis('tight') \n # loc=2 is used for the top left corner \n plt.legend(\n [a.collections[0], b,c],\n ['learned decision function', 'inliers','outliers'],\n prop=matplotlib.font_manager.FontProperties(size=20),\n loc=2)\n plt.xlim((0, 1))\n plt.ylim((0, 1))\n plt.title(clf_name)\n #tmpfile = BytesIO()\n #plt.savefig(tmpfile, format='png')\n #plt.savefig('/dbfs/FileStore/figure.png')\n plt.show() \n # encoded = base64.b64encode(tmpfile.getvalue()).decode('utf-8')\n # print(\"done2\")\n #text = 'OUTLIERS : '+ str(n_outliers)+', INLIERS : '+str(n_inliers)\n #clf_text = clf_name\n #output_file = \"adl://psinsightsadlsdev01.azuredatalakestore.net/DEV/AnomalyDetection_chart\" + clf_text + '.html'\n #html = '

{clf_text}

{text}

\"Plot\"

'\n #print(html)\n \n #print(html2)\n #html3 = html2+html_data\n #s = Template(html).safe_substitute(clf_text=clf_text)\n #t = Template(s).safe_substitute(text=text)\n #print(t)\n #dbutils.fs.put(\"/dbfs/FileStore/anamolydetection.html\", \"Contents of my file\")\n #dbutils.fs.cp (\"/dbfs/FileStore/anamolydetection.html\", output_file, True)\n #print(html3)\n #with open(output_file,'w') as f:\n # f.write(t)\n \n #filepath=\"adl://psinsightsadlsdev01.azuredatalakestore.net/DEV/AnomalyDetection.html\"\n ##plt.savefig(tmpfile, format='png')\n #plt.savefig('/dbfs/FileStore/AnomalyDetection.png')\n #dbutils.fs.cp (\"/FileStore/AnomalyDetection.png\", filepath, True)\n #print(\"Anomaly Detection Report can be downloaded from path: \",filepath)\n"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"70b92870-9a8b-413d-83e9-6b9f2dc6bd81"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["#Calling the Anamoly Detection Function for identifying outliers \noutliers_fraction = 0.05\ndf =pd.read_csv(\"/dbfs/FileStore/Cleansed_Covid.csv\", header='infer')\ntarget_variable = 'ActiveCasesRate'\nvariables_to_analyze='Confirmed'\n\nAnomalyDetection(df,target_variable,variables_to_analyze,outliers_fraction,'anomaly_test','test')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Anomaly Detection","showTitle":false,"inputWidgets":{},"nuid":"93afcdce-1c38-4cbc-8560-21d2fb695e38"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"markdown","source":["## 6.Feature Selection\nPerform feature selection on the basis of Feature Importance ranking, correlation values, variance within the column.\nChoose features with High Importance value score, drop one of the two highly correlated features, drop features which offer zero variability to data and thus do not increase the entropy of dataset."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"01e0c7fc-2882-433d-bdf3-b60158f43303"}}},{"cell_type":"code","source":["df =pd.read_csv(\"/dbfs/FileStore/Cleansed_Covid.csv\", header='infer')\nFeatureSelection(df,'ActiveCasesRate','Continuous',\"/dbfs/FileStore/Cleansed_Covid.csv\",'Covid','FeatureSelection')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"befb37b1-f0f6-4ce5-86e7-29d7def14c09"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["%pip install azureml-train-automl-runtime\n%pip install azureml-automl-runtime\n%pip install azureml-widgets\n%pip install azureml-sdk[automl]\n%pip install ipywidgets\n%pip install pandas-profiling\n%pip install pyod\n%pip install azureml-sdk\n%pip install azureml-explain-model\n%pip install imbalanced-learn\n%pip install pyod\n%pip install skfeature-chappers\n%pip install raiwidgets \n\n%pip install ruamel.yaml==0.16.10\n%pip install azure-core==1.8.0\n%pip install liac-arff==2.4.0\n%pip install msal==1.4.3\n%pip install msrest==0.6.18\n%pip install ruamel.yaml.clib==0.2.0\n%pip install tqdm==4.49.0\n%pip install zipp==3.2.0\n%pip install interpret-community==0.15.0\n%pip install azure-identity==1.4.0\n%pip install dotnetcore2==2.1.16\n%pip install jinja2==2.11.2\n%pip install azure-core==1.15.0\n%pip install azure-mgmt-containerregistry==8.0.0\n%pip install azure-mgmt-core==1.2.2\n%pip install distro==1.5.0\n%pip install google-api-core==1.30.0\n%pip install google-auth==1.32.1\n%pip install importlib-metadata==4.6.0\n%pip install msal==1.12.0\n%pip install packaging==20.9\n%pip install pathspec==0.8.1\n%pip install requests==2.25.1\n%pip install ruamel.yaml.clib==0.2.4\n%pip install tqdm==4.61.1\n%pip install zipp==3.4.1\n%pip install scipy==1.5.2\n%pip install charset-normalizer==2.0.3\n%pip install websocket-client==1.1.0\n%pip install scikit-learn==0.22.1\n%pip install interpret-community==0.19.0\n%pip install cryptography==3.4.7\n%pip install llvmlite==0.36.0\n%pip install numba==0.53.1"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"e94c1d2f-6f2c-4ca9-960e-5a98c3e1d6d0"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"markdown","source":["## 7.Auto ML Trigger - after preprocessing\nTrigger Azure auto ML, pick the best model so obtained and use it to predict the label column. Calculate the Weighted Absolute Accuracy amd push to telemetry. also obtain the data back in original format by using the unique identifier of each row 'Index' and report Actual v/s Predicted Columns. We also provide the direct link to the azure Portal Run for the current experiment for users to follow."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"40323805-a63a-4500-929c-24297b5edca2"}}},{"cell_type":"code","source":["##df has just index,y actual, y predicted cols, as rest all cols are encoded after manipulation\nimport pandas as pd\ndf =pd.read_csv(\"/dbfs/FileStore/Cleansed_Covid.csv\", header='infer')\nfor col in df.columns:\n if col not in [\"Index\"]: \n df.drop([col], axis=1, inplace=True)\n \n#dataframe is the actual input dataset \ndataframe = pd.read_csv(\"/dbfs/FileStore/Covid.csv\", header='infer')\n\n#Merging Actual Input dataframe with AML output df using Index column\ndataframe_fin = pd.merge(left=dataframe, right=df, left_on='Index', right_on='Index')\n#dataframe_fin\n\n#train-test split\ntrain_data=dataframe_fin[dataframe_fin['Year']==2020]\ntest_data=dataframe_fin[dataframe_fin['Year']==2021]\nlabel='DeathsRate'#'ActiveCasesRate'\ntest_labels = test_data.pop(label).values\ntrain_data"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Train-Test Split","showTitle":true,"inputWidgets":{},"nuid":"cecbfbf6-e795-4938-9722-2708aa6e8277"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["#train_date duplicate check\ntrain_data[train_data.duplicated(['State','Date'])]\n"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"7831b400-8107-4813-8cd4-82652261a26c"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["#frequency <=3 check\ndf_new= train_data.groupby(['State']).count()\nfreq = df_new[(df_new['Index'] <= 3)]\nfreq"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"063b5d46-9258-412b-b8e2-d5c951f715dd"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["##2 removed lakshdweep\n#frequency <=3 check\ndf_new= train_data.groupby(['State']).count()\nfreq = df_new[(df_new['Index'] <= 3)]\nfreq"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"739fe95c-ba2b-44a8-a771-d404e32ff446"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["time_series_settings = {\n \"time_column_name\": \"Date\",\n \"grain_column_names\": [\"State\"],\n \"max_horizon\": 2,\n \"target_lags\": 2,\n \"target_rolling_window_size\": 2,\n \"featurization\": \"auto\",\n \"short_series_handling_configuration\":'auto',\n \"freq\": 'MS',\n \"short_series_handling_config\": \"auto\"\n}\n\nfrom azureml.core.workspace import Workspace\nfrom azureml.core.experiment import Experiment\nfrom azureml.train.automl import AutoMLConfig\nimport logging\nfrom azureml.core.compute import ComputeTarget, AmlCompute\nfrom azureml.core.compute_target import ComputeTargetException\nfrom azureml.core.experiment import Experiment\nfrom azureml.core import Workspace\nfrom azureml.core.authentication import ServicePrincipalAuthentication\nfrom azureml.core.dataset import Dataset\nfrom azureml.widgets import RunDetails\nfrom azureml.core import Dataset, Datastore\nfrom azureml.data.datapath import DataPath\nfrom sklearn.metrics import confusion_matrix \nfrom sklearn.metrics import accuracy_score \nfrom sklearn.metrics import classification_report\nimport os\nimport warnings\nfrom sklearn.metrics import mean_squared_error\nfrom math import sqrt\nwarnings.filterwarnings('ignore')\n\nautoml_config = AutoMLConfig(task='forecasting',\n primary_metric='normalized_root_mean_squared_error',\n iterations= 1,\n experiment_timeout_minutes=15,\n enable_early_stopping=True,\n n_cross_validations=2,\n training_data=train_data,\n label_column_name=label,\n enable_ensembling=False,\n verbosity=logging.INFO,\n **time_series_settings)\n\n\nws = Workspace(subscription_id = '', resource_group = '', workspace_name = '')\n#ws = Workspace.from_config()\n\n# Verify that cluster does not exist already\ncompute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D12_V2',\n max_nodes=100)\ncompute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\ncompute_target.wait_for_completion(show_output=True)\n \ndatastore = ws.get_default_datastore()\ntrain_dataset = Dataset.Tabular.register_pandas_dataframe(train_data,datastore,'Covid')\ntest_dataset = Dataset.Tabular.register_pandas_dataframe(test_data,datastore,'Covid')\n \nexperiment = Experiment(ws, \"TS_forecasting\")\nremote_run = experiment.submit(automl_config, show_output=True)\nremote_run.wait_for_completion()\nbest_run, fitted_model = remote_run.get_output()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Auto ML Run","showTitle":true,"inputWidgets":{},"nuid":"43a93019-b057-48b0-9fed-cfad38e9d1fc"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["y_predictions, X_trans = fitted_model.forecast(test_data)\ny_predictions"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Forecasting for Test set","showTitle":true,"inputWidgets":{},"nuid":"cee1e13c-3558-4e28-9524-fe7b41ec42b2"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["result = test_data\ny_predictions, X_trans = fitted_model.forecast(test_data)\nresult['Values_pred']=y_predictions\nresult['Values_actual']=test_labels\nresult['Error']=result['Values_actual']-result['Values_pred']\nresult['Percentage_change'] = ((result['Values_actual']-result['Values_pred']) / result['Values_actual'] )* 100\nresult=result.reset_index(drop=True)\nresult"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Data Actuals v/s Predicted","showTitle":true,"inputWidgets":{},"nuid":"09bc5959-d6f2-4d0a-803b-0cac976e48ea"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0},{"cell_type":"code","source":["y_actual = test_labels\nsum_actuals = sum_errors = 0\nfor actual_val, predict_val in zip(y_actual, y_predictions):\n abs_error = actual_val - predict_val\n if abs_error < 0:\n abs_error = abs_error * -1\n sum_errors = sum_errors + abs_error\n sum_actuals = sum_actuals + actual_val\n\nmean_abs_percent_error = sum_errors / sum_actuals\nAccuracy_score = 1 - mean_abs_percent_error\nprint(Accuracy_score)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Accuracy Calculation","showTitle":true,"inputWidgets":{},"nuid":"84e39f2b-05f2-4c22-ad18-fc35e1da37b8"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"data":"","errorSummary":"","metadata":{},"errorTraceType":null,"type":"ipynbError","arguments":{}}},"output_type":"display_data","data":{"text/html":[""]}}],"execution_count":0}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"Trigger_Final(Covid) (1)","dashboards":[],"notebookMetadata":{"pythonIndentUnit":2},"language":"python","widgets":{},"notebookOrigID":3621682492188219}},"nbformat":4,"nbformat_minor":0}