Simulations can be powerful techniques used to solve problems. Due to the difficulty of conducting several large scale experiments due to the the large distribution of the the population size, samples are often taken from a representative probability distribution.

A probability distribution is a mathematical function that gives probabilities of occurrence of different possible outcomes for an experiment.

A probability distribution for a coin toss would turn up 2 values, which can be represented by numbers 0, to indicate heads, and 1 to indicate tails. On the other hand, a probability distribution for a dice roll would turn up 6 values.

Discrete Random Variables (DRV)

So far, the two scenarios we have discussed involved what can be described as discrete variables. Discrete variables is a kind of variable that can only take on specific values, such as integers 0 and 1.

For a discrete value, the probability density function (PDF) for a discrete value describes the probabilities for each event occurring. For instance, let us create a probability table for likelihood of a dice roll, where x is the outcome of the dice roll.

x123456
\Pr(X = x) or PDF\frac{1}{6}\frac{1}{6}\frac{1}{6}\frac{1}{6}\frac{1}{6}\frac{1}{6}

Additionally, the cumulative distributive function (CDF) for a discrete value measures the probability that a chosen value or values less than the chosen value occurs in the sample distribution. Adding on to the table earlier, we get:

x123456
\Pr(X \leq x) or CDF\frac{1}{6}\frac{2}{6}\frac{3}{6}\frac{4}{6}\frac{5}{6}\frac{6}{6}

Continuous Random Variables (CRV)

Another kind of variable, known as a continuous random variable, which can take on infinitely many values, between a specified maximum and minimum. For instance, the time taken to travel from one place to another is a continuous variable as it can take on infinite possible duration of times (seconds, milliseconds, so on.)

For any continuous random variable, which can take on an infinite range of values, the the probability that a generated random number converges to a specific value is 0.1 Hence, the probability density function (PDF) for a continuous random variable* gives the probability that any value in a continuous set of values might occur (aka probability of a random variable falling within a particular range of values in the sample distribution.).

This is shown as the area under the continuous variable’s probability density function, as follows:

\Pr(a\leq X \leq b) = \int_{a}^{b}{f_x(x) dx}

where

  • is the integral or area from x=a to x=b
  • dx integrate with respect to x
  • f_x(x) is the probability density function

1  f_x(X = x) = 0 where f_x represents the probability density function.

Meanwhile, the Cumulative Distribution Function (CDF) of a continuous random variable is a function derived from the probability density function for a continuous variable (as shown above). The cumulative distribution function F_X(x) is defined for any random variable X as 

F_X(x) = \Pr(X\leq x)

which is the probability that any random variable X is less than or equal to x

How are the PDF and the CDF related for DRV and CRV?

For a CRV
If we replace the equation for the PDF of a CRV with the one above, we get :

\Pr(a\le X \le b) = \Pr(X\le b)-\Pr(X\le a) = F_X(b)-F_X(a)

This means that the definite integral (for a continuous random variable) probability density function is the same as the differences of the cumulative distributive functions.

A visual understanding would be to look at the Probability Density Function (PDF) of a continuous random variable. If we take the total area under the graph, it would add up to 100% or 1.

On the other hand, the Cumulative Distributive Function gives the area under the probability density function (the line) from either -ve infinity  to x, the chosen value.

Often in statistics, we want to query areas quickly under the CRV graph, such as the area in darker blue (1 standard deviation from the mean). Instead of computing individual CDF, we want to look for the PDF directly. Thus, the PDF is usually more frequently used.

The linear problem: A Thought Experiment 🤔

Sometimes, not all random values have an equal chance of being chosen. Imagine for instance if we were to attempt to simulate the potential of Covid-19 variant to spread through a city population.

Modelling the complex daily interactions has to take into account a random number which simulates the timings or places where individuals conduct essential or non-essential activities, or duration of these interactions (eg. “longer interactions” through meeting at food centres). Random numbers will need to be generated to estimate the duration of meeting, and the number of people at these areas. Typical population sizes of households will likely differ from workplaces and definitely from more densely populated areas such as worker dormitories. Adding the effect of pandemic public health measures – social distancing, wearing masks will further adjust the probabilities. In essence, simulating social networks taking into account a myriad of factors with different probabilities, which will affect the random numbers generated as time progresses in the simulation. 2

One method to generate a random number sampling that reflects the distribution of the numbers is to use inverse transform sampling (ITS). ITS is a method to generate samples at random from any probability distribution given its cumulative distribution function.

2 For those interested, a simulation has been conducted on the spread of Covid-19 in Singapore and is detailed in the Nature journal. It illustrates some of the various ways random numbers are generated and used in the simulation.

How ITS works

https://www.desmos.com/calculator/hrsfj8nohi

Continuing from our earlier example of simulating the spread of Covid-19. Let’s model the number of people who visit the MRT on a daily basis. We can model the daily number using a Poisson Distribution. The poisson distribution expresses probability of X no. of events occuring in a given duration if these events occur with (1) a known constant rate (eg. frequency) and (2) each event is independent of the previous one.

The Poisson distribution is given by:

\Pr(X{=}k)= \frac{\lambda^k e^{-\lambda}}{k!}

where

  • k is the expected integer number of occurrences

Next, we plot the continuous distribution function which describes this distribution given by:

e^{-\lambda} \sum_{i=0}^{\lfloor k\rfloor} \frac{\lambda^i}{i!}

where
\lfloor k\rfloor is the floor function of k

The CDF is not a smooth curve as it takes on only integer values. However, we can still observe that if we were to draw a horizontal line across for a high confidence probability value , the chance of hitting a specific number greater than 7 is incredibly slim.

In order to make the probabilities the independent variables, we will have to “swap” X and Y, by performing the inverse. To see how the distribution is reflected in the creation of random numbers, we can generate random numbers from this distribution using kotlin.

However, notice that if we reduce the value of lambda, the values become more clustered. In order to get an accurate average, we are forced to take more samples, which will take more time (computationally intensive)

Other methods: The Acceptance Rejection Sampling method

This method comes from a broader class of methods grouped under the Monte Carlo Simulations.

Imagine wanting to find the radius of a circle. We get a perfectly circular dish with radius X cm and square with sides length X cm. We lay these two dishes a distance away from each other, and begin by randomly depositing marbles over the areas of similar mass into the dishes. By the end of the experiment, we can take the value of

\frac{\textrm{mass of circular dish}}{\textrm{mass of square dish}}\approx\frac{\pi X^2}{X^2}\approx\pi

This example can be seen at https://www.youtube.com/watch?v=7ESK5SaP-bc

Similarly, the Acceptance Rejection Sampling Method replaces the dishes with the probability density function of a variable, and samples uniformly within the maximum boundaries of the graph. It rejects values that fall outside of the distribution, and returns x-values which fall within the graph.

Issues:
Intuitively, one realises that if the distribution is highly limited or concentrated in area, there may a lot of unwanted values during sampling.