Этот файл содержит неоднозначные символы Юникода, которые могут быть перепутаны с другими в текущей локали. Если это намеренно, можете спокойно проигнорировать это предупреждение. Используйте кнопку Экранировать, чтобы подсветить эти символы.
By Data Sampling, we can select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger dataset being examined. The dataset thus obtained is a weighted sample of the actual dataset, thus enabling a clear picture of the bigger dataset with best performance, retaining the overall data density and distribution. The following method is used to obtain samples of data from the original input data using different techniques and the best sample thus obtained is suggested to the user. The function ‘Sampling’ encompasses all the features of this as explained below.
-
Get the ideal sample size from the original input dataset using Solven’s formula
n=N/((1+N^2 ) )
Here,
n=Number of Samples
N=Total Population
e=Error tolerance (level) = 1-Confidence Level in percentage (~ 95%) -
Random Sampling
Pick (n) items from the whole actual dataset (N) randomly assuming every item has equal probability (1/N) of getting its place in the sample irrespective of its weightage in the actual dataset. -
Systematic Sampling
This method allows to choose the sample members of a population at regular intervals. It requires the selection of a starting point for the sample and sample size that can be repeated at regular intervals. This type of sampling method has a predefined range, and hence this sampling technique is the least time-consuming. Pick every kth item from the actual dataset where k = N/n -
Stratified Sampling
Clustering :- Classify input data into k clusters using K-means clustering and add an extra column to the data frame ‘Cluster’ to identify which record belongs to which cluster (0- to k-1). Get the ideal ‘k’ for a dataset using Silhouette score. The silhouette coefficient of a data measures how well data are grouped within a cluster and how far they are from other clusters. A silhouette close to 1 means the data points are in an appropriate cluster and a silhouette coefficient close to −1 implies that data is in the wrong cluster. i.e., get the scores for a range of values for k and choose the cluster k value which gives Highest Silhouette score. Weighted Count :- Get the percentage count of records corresponding to each cluster in actual dataframe, create a weighted subsample of (n) records maintaining the same weighted distribution of records from each cluster. -
Clustered Sampling
If the input data is having a predefined distribution to different classes, check if the distribution is biased towards one or more classes. If yes, then apply SMOTE(Synthetic Minority Oversampling Technique) to level the distribution for each class. This approach for addressing imbalanced datasets is to oversample the minority class. This involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. Create a weighted subsample of (n) records maintaining the same weighted distribution of records from each cluster (after SMOTE). -
Get the sampling error
The margin of error is 1/√n, where n is the size of the sample for each of the above techniques. -
Getting the best Sample obtained
Using a Null Hypothesis for each column, calculate the p-value using Kolmogorov-Smirnov test (For Continuous columns) and Pearson's Chi-square test (for categorical columns). If the p-values are >=0.05 for more than a threshold number of columns (50% used here), the subsample created is accepted. P-value can be used to decide whether there is evidence of a statistical difference between the two population (Sample v/s the Original dataset) means. The smaller the p-value, the stronger the evidence is that the two populations have different means. The samples obtained above that has the highest average p-value is suggested to be the closest to the actual dataset. p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.
Chi-Square Test for Homogeneity : A chi-square test for homogeneity is a statistical test that uses the properties of the chi-square probability distribution to assess whether two samples of data share a common underlying distribution. In other words, the test measures the likelihood that two samples come from similarly distributed populations. When conducting a chi-square test for homogeneity, the test's null hypothesis is that the two samples under study come from populations governed by similar probability distributions , and the test's alternative hypothesis is that the two samples come from populations governed by different probability distributions. Chi-square tests for homogeneity yield a p-value as an output, which can be interpreted to support or refute hypotheses.
P-Value : A p-value quantifies the probability of obtaining data as extreme as (or more extreme than) a set of data collected during an experiment, provided that the null hypothesis of the experiment is true. If a statistical test returns a p-value of 0.07, for example, then this p-value indicates that there would be a 7% chance of obtaining the collected data if the null hypothesis were true i.e. the two samples under study come from populations governed by similar probability distributions Hence higher the p-value higher is the probability that the two samples under study come from populations governed by similar probability distributions. If a given set of data align with the expectations specified by the null hypothesis, then it is likely that the null hypothesis is a true statement . However, if the data do not seem consistent with the null hypothesis, then that can be used as evidence to support the claim that the null hypothesis is false.