Sampling- Different Methods and their Implementation

Sampling- Different Methods and their Implementation

Discussing sample size calculation, different sampling techniques, and associated code snippets in Python to perform sampling in large datasets

Buzzwords in this article:

  • Population: It includes all the elements from the data set.

  • Sample: It includes one or more observations that are drawn from the population and the measurable characteristic of a sample is a statistic.

  • Sample size: A sample size is a part of the population chosen for a survey or experiment.

  • Sampling Bias: The samples that are collected to determine their distribution are selected incorrectly and do not represent the true distribution because of non-random reasons.

Sample size is always lesser than the population

  • The process of selecting a sample is called sampling.

PopulationAndSample_2.png


Why and when to do sampling?

Sampling helps us in attaining information about the entire population. The sample that is taken from the population will not have completely accurate results, but it is close to the accurate result. A sample should be representative of the population and not biased in any manner, just like how we elect MPs and MLAs as a representative of the population in a certain region.
Samples are used when the population is large, scattered, or if it's hard to collect data on individual instances within it. You can then use a small sample of the population to generalize the characteristics of the dataset.
Samples should be randomly selected and should represent every class within it. Sampling is also a time-convenient and cost-effective method and hence forms the basis of any research design.

How to calculate sample size?

Finding a sample size can be one of the most challenging tasks in statistics and depends upon many factors including the size of your original population. For example, the standard deviation of a sample can be used to approximate the standard deviation of a population.

Different methods to calculate sample size:

  • Use a sample size from a similar study : Your type of study has already been undertaken by someone else. You’ll need access to academic databases to search for a study.
  • Use a sample size calculator : Various calculators are available online, some simple, some more complex, and specialized. You can use this for group- or cluster-randomized trials (GRTs)
  • Cochran's formula :

Types of Sampling:

There are two broad types of sampling which include:

  • Probability Sampling: Probability sampling is defined as a sampling technique in which the researcher chooses samples from a larger population using a method based on the theory of probability. For a participant to be considered as a probability sample, he/she must be selected using a random selection.
  • Non-probability Sampling: In non-probability sampling, the researcher chooses members for research at random. This sampling method is not a fixed or predefined selection process. This doesn't allow all elements of a population to have equal opportunities to be included in a sample.

Probability Sampling:

Probability Sampling method considers every member of the population and forms samples based on a fixed process.
For example, in a population of 5500 members, every member will have a 1/5500 chance of being selected to be a part of a sample. Probability sampling eliminates bias in the population and gives all members a fair chance to be included in the sample.

pr.jpg

There are four techniques in Probability Sampling:

  • Simple random sampling:

    One of the best probability sampling techniques that helps in saving time and resources, is the Simple Random Sampling method. It is a reliable method of obtaining information where every single member of a population is chosen randomly, merely by chance. Each individual has the same probability of being chosen to be a part of a sample.
    For example,, in a college of 250 students in a particular branch, if the Head of Department decides on conducting team building activities for personality development, it is highly likely that they would prefer picking out students at random and group them. In this case, each of the 250 students has an equal opportunity of being selected.

srs.jpg

Below is a simple python snippet to perform Simple Random Sampling:

sample_df = df.sample(250)
  • Cluster Sampling :

    Cluster sampling is a type of probability sampling in which each element of the population is selected equally, but instead of considering all members in a population, we use the subsets of the population as the sampling part. The population is divided into subsets or subgroups called clusters, and from the numbers of clusters, we select the individual cluster for the next step to be performed.

cluster.png

#randomly choose 4 groups out of 10
clusters = np.random.choice(np.arange(1,11), size=4, replace=False)
#define sample as all members who belong to one of the 4 groups
cluster_sample = df[df['coloumnname'].isin(clusters)]
  • Systematic Sampling :

    Systematic sampling is defined as a probability sampling method where one chooses elements from a target population by selecting a random starting point and selects sample members after a fixed sampling interval. For example, the sample interval should be 5, which is the result of the division of 2500 (N= size of the population) and 500 (n=size of the sample).

Systematic Sampling Formula for interval (i) = N/n = 2500/500 = 5

Select the members who fit the criteria which in this case will be 1 in 5 individuals and randomly choose the starting member (r) of the sample and add the interval(in this example '5') to the random number to keep adding members in the sample. r, r+i, r+2i, etc. will be the elements of the sample.

sys.jpg

We can achieve this in Python by:

sys_sample_df = df.iloc[::5]
  • Stratified Sampling :

    Stratified sampling is a probability sampling technique where the entire population is divided into multiple non-overlapping, homogeneous groups known as strata and the final members for the sample from the various strata under consideration are randomly chosen. Members in each of these groups should be distinct so that every member of all groups gets an equal opportunity to be selected using simple probability. This sampling method is also called “random quota sampling”.

Well, doesn't this seem like Cluster Sampling? I do agree that both are similar, however, the key difference between the two is that in
Cluster Sampling, selection of the sample is done by randomly selected clusters and including all the members from these clusters
whereas, in
Stratified Sampling, the selection of the sample is done by randomly selecting members from various formed strata. str The accuracy of statistical results of Stratified Sampling is higher than Simple Random Sampling since the elements of the sample are chosen from relevant strata. The diversification within the strata will be much lesser than the diversification which exists in the target population.

We can implement this type of sampling using the readily-available StratifiedShuffleSplit class of Scikit-Learn in Python

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

Non-Probability sampling:

Non-probability sampling is defined as a sampling technique in which one selects the samples based on subjective judgment rather than random selection.
Unlike probability sampling and its methods, non-probability sampling does not focus on accurately representing all members of a large population within a smaller sample group of participants. As a result, not all members of the population have an equal chance of participating in the study.

A non-probability sample is selected based on non-random criteria. Non-probability sampling often results in biased samples because some members of the population are more likely to be included than others.

Types of Non-Probability Sampling:

npr.jpg

  • Convenience sampling:

    Convenience sampling is a common type of Non-Probability Sampling where you choose participants for a sample, based on their convenience and availability. This non-probability sampling method is used when there are time and cost limitations in collecting feedback. In situations where there are resource limitations such as the initial stages of research, convenience sampling is used.

conv.jpeg

  • Judgmental or Purposive Sampling:

    Judgemental Sampling is a type of Non-Probability Sampling where you make a conscious decision on what the sample needs to include and choose participants accordingly. You use your understanding of the research’s purpose and your knowledge of the population to judge what the sample needs to include to attain the required outcomes. You must, however, cross-validate whether a prospective sample member fits the criteria you’re after.
    There are obvious bias issues with this type of sample selection method, though you have all the freedom to create the sample to fit the needs of your research. Researchers purely consider the purpose of the study, along with the understanding of the target audience.
    For instance, when researchers want to understand the thought process of people interested in buying e-sport accessories. The selection criteria will be: “Are you interested in buying a joystick to enhance the e-sport experience?” and those who respond with a “No” or "Not Interested" are excluded from the sample. jud

  • Quota sampling:

    Quota sampling is a Non-Probability Sampling technique similar to Stratified Sampling. In this method, the population is split into segments (strata) and you have to fill a quota based on people who match the characteristics of each stratum.
    The selection of members in this sampling technique happens based on a pre-set standard. In this case, as a sample is formed based on specific attributes, the created sample will have the same qualities found in the total population. It is a rapid method of collecting samples.

quota

  • Snowball sampling:

Snowball sampling is a Non-Probability Sampling type that mimics a pyramid system in its selection pattern. You choose early sample participants, who then go on to recruit further sample participants until the sample size has been reached. It is applied when the subjects are difficult to trace.This ongoing pattern can be perfectly described by a snowball rolling downhill: increasing in size as it collects more snow (in this case, participants).

It is an MLM pyramid scheme

snow.jpg

With this model, you are relying on who your initial sample members know to fulfill your ideal sample size. This can be quick to do when the chain of members develops past the first few levels. However, it does rely on the first members referring the research work to others.

Conclusion:

For any research, it is essential to choose a sampling method accurately to meet the goals of your study. The effectiveness of your sampling relies on various factors. Sampling can actually be more accurate than studying an entire population because it affords researchers a lot more control over the subjects. Statistical manipulations are much easier with smaller data sets, and it is easier to avoid human error when inputting and analyzing the data. However, it is an important task for every researcher to select samples with minimal bias so that it accurately represents the data.