Key Concepts

Review core concepts you need to learn to master this subject

Datasets and their Histograms

When datasets are plotted as histograms, the way the data is distributed determines the distribution type of the data.

The number of peaks in the histogram determines the modality of the dataset. It can be unimodal (one peak), bimodal (two peaks), multimodal (more than two peaks) or uniform (no peaks).

Unimodal datasets can also be symmetric, skew-left or skew-right depending on where the peak is relative to the rest of the data.

Normal Distribution using Python Numpy module

Normal distribution in NumPy can be created using the below method.

np.random.normal(loc, scale, size)

Where loc is the mean for the normal distribution, scale is the standard deviation of the distribution, and size is the number of observations the distribution will have.

Standard deviation

The standard deviation of a normal distribution determines how spread out the data is from the mean.

  • 68% of samples will fall between +/- 1 standard deviation of the mean.

  • 95% of samples will fall between +/- 2 standard deviations of the mean.

  • 99.7% of samples will fall between +/- 3 standard deviations of the mean.

Histogram Visualization

A histogram is a plot that visualizes the distribution of samples in a dataset. Histogram shows the frequency on the vertical axis and the horizontal axis is another dimension. Usually horizontal axis has bins, where every bin has a minimum and maximum value. Each bin also has a frequency between x and infinite.

Chevron Left Icon
Introduction to Statistics with NumPy
Lesson 1 of 2
Chevron Right Icon
  1. 1
    You’re a citizen scientist who has started collecting data about rising water in the river next to where you live. For months, you painstakingly measure the water levels and enter your findings int…
  2. 2
    The first statistical concept we’ll explore is mean, also commonly referred to as an average. The mean is a useful measurement to get the center of a dataset. NumPy has a built-in function to cal…
  3. 3
    We can also use np.mean to calculate the percent of array elements that have a certain property. As we know, a logical operator will evaluate each item in an array to see if it matches the specif…
  4. 4
    If we have a two-dimensional array, np.mean can calculate the means of the larger array as well as the interior values. Let’s imagine a game of ring toss at a carnival. In this game, you have thr…
  5. 5
    As we can see, the mean is a helpful way to quickly understand different parts of our data. However, the mean is highly influenced by the specific values in our data set. What happens when one of t…
  6. 6
    One way to quickly identify outliers is by sorting our data, Once our data is sorted, we can quickly glance at the beginning or end of an array to see if some values lie far beyond the expected ran…
  7. 7
    Another key metric that we can use in data analysis is the median. The median is the middle value of a dataset that’s been ordered in terms of magnitude (from lowest to highest). Let’s look at …
  8. 8
    In a dataset, the median value can provide an important comparison to the mean. Unlike a mean, the median is not affected by outliers. This becomes important in skewed datasets, datasets whose va…
  9. 9
    As we know, the median is the middle of a dataset: it is the number for which 50% of the samples are below, and 50% of the samples are above. But what if we wanted to find a point at which 40% of t…
  10. 10
    Some percentiles have specific names: - The 25th percentile is called the first quartile - The 50th percentile is called the median - The 75th percentile is called the *third quarti…
  11. 11
    While the mean and median can tell us about the center of our data, they do not reflect the range of the data. That’s where standard deviation comes in. Similar to the interquartile range, the …
  12. 12
    As we saw in the last exercise, knowing the standard deviation of a dataset can help us understand how spread out our dataset is. We can find the standard deviation of a dataset using the Numpy f…
  13. 13
    Let’s review! In this lesson, you learned how to use NumPy to analyze single-variable datasets. Here’s what we covered: - Using the np.sort method to locate outliers. - Calculating central positio…
  1. 1
    A university wants to keep track of the popularity of different programs over time, to ensure that programs are allocated enough space and resources. You work in the admissions office and are asked…
  2. 2
    When we first look at a dataset, we want to be able to quickly understand certain things about it: - Do some values occur more often than others? - What is the range of the dataset (i.e., the min …
  3. 3
    Suppose we had a larger dataset with values ranging from 0 to 50. We might not want to know exactly how many 0’s, 1’s, 2’s, etc. we have. Instead, we might want to know how many values fall betwee…
  4. 4
    We can graph histograms using a Python module known as Matplotlib. We’re not going to go into detail about Matplotlib’s plotting functions, but if you’re interested in learning more, take our co…
  5. 5
    Histograms and their datasets can be classified based on the shape of the graphed values. In the next two exercises, we’ll look at two different ways of describing histograms. One way to classify…
  6. 6
    Most of the datasets that we’ll be dealing with will be unimodal (one peak). We can further classify unimodal distributions by describing where most of the numbers are relative to the peak. A *sym…
  7. 7
    The most common distribution in statistics is known as the normal distribution, which is a symmetric, unimodal distribution. Lots of things follow a normal distribution: - The heights of a large…
  8. 8
    We can generate our own normally distributed datasets using NumPy. Using these datasets can help us better understand the properties and behavior of different distributions. We can also use them to…
  9. 9
    In a normal distribution, we know that the mean and the standard deviation determine certain characteristics of the shape of our data, but how exactly? Let’s do some exploration to find out!
  10. 10
    We know that the standard deviation affects the “shape” of our normal distribution. The last exercise helps to give us a more quantitative understanding of this. Suppose that we have a normal dis…
  11. 11
    It’s known that a certain basketball player makes 30% of his free throws. On Friday night’s game, he had the chance to shoot 10 free throws. How many free throws might you expect him to make? We …
  12. 12
    There are some complicated formulas for determining these types of probabilities. Luckily for us, we can use NumPy - specifically, its ability to generate random numbers. We can use these random nu…
  13. 13
    Let’s return to our original question: Our basketball player has a 30% chance of making any individual basket. He took 10 shots and made 4 of them, even though we only expected him to make 3. Wh…
  14. 14
    Let’s review! In this lesson, you learned how to use NumPy to analyze different distributions and generate random numbers to produce datasets. Here’s what we covered: - What is a histogram and how…

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo