## Key Concepts

Review core concepts you need to learn to master this subject

### NumPy’s Mean and Axis

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

In a two-dimensional array, you may want the mean of just the rows or just the columns. In Python, the NumPy `.mean()` function can be used to find these values. To find the average of all rows, set the axis parameter to 1. To find the average of all columns, set the axis parameter to 0.

### Conditions in Numpy.mean()

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

In Python, the function `numpy.mean()` can be used to calculate the percent of array elements that satisfies a certain condition.

### NumPy Percentile Function

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

In Python, the NumPy `.percentile` function accepts a NumPy array and percentile value between 0 and 100. The function returns the value of the array element at the percentile specified.

### NumPy’s Percentile and Quartiles

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

In Python, the NumPy `.percentile()` function can calculate the first, second and third quartiles of an array. These three quartiles are simply the values at the 25th, 50th, and 75th percentiles, so those numbers would be the parameters, just as with any other percentile.

### NumPy’s Sort Function

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

In Python, the NumPy `.sort()` function takes a NumPy array and returns a different NumPy array, this one containing the same numbers in ascending order.

### Definition of Percentile

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

In statistics, a data set’s Nth percentile is the cutoff point demarcating the lower N% of samples.

### Datasets and their Histograms

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

When datasets are plotted as histograms, the way the data is distributed determines the distribution type of the data.

The number of peaks in the histogram determines the modality of the dataset. It can be unimodal (one peak), bimodal (two peaks), multimodal (more than two peaks) or uniform (no peaks).

Unimodal datasets can also be symmetric, skew-left or skew-right depending on where the peak is relative to the rest of the data.

### Normal Distribution using Python Numpy module

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

Normal distribution in NumPy can be created using the below method.

`np.random.normal(loc, scale, size)`

Where `loc` is the mean for the normal distribution, `scale` is the standard deviation of the distribution, and `size` is the number of observations the distribution will have.

### Standard deviation

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

The standard deviation of a normal distribution determines how spread out the data is from the mean.

• 68% of samples will fall between +/- 1 standard deviation of the mean.

• 95% of samples will fall between +/- 2 standard deviations of the mean.

• 99.7% of samples will fall between +/- 3 standard deviations of the mean.

### Histogram Visualization

```We will use the following 2-dimensional array for this example: ``` py ring_toss = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 1]]) ``` The code below will calculate the average of each row. ```py np.mean(ring_toss, axis=1) # Output: array([ 0.33333333, 0.33333333, 0.66666667]) ``````

A histogram is a plot that visualizes the distribution of samples in a dataset. Histogram shows the frequency on the vertical axis and the horizontal axis is another dimension. Usually horizontal axis has bins, where every bin has a minimum and maximum value. Each bin also has a frequency between x and infinite.

Introduction to Statistics with NumPy
Lesson 1 of 2
1. 1
You’re a citizen scientist who has started collecting data about rising water in the river next to where you live. For months, you painstakingly measure the water levels and enter your findings int…
2. 2
The first statistical concept we’ll explore is mean, also commonly referred to as an average. The mean is a useful measurement to get the center of a dataset. NumPy has a built-in function to cal…
3. 3
We can also use np.mean to calculate the percent of array elements that have a certain property. As we know, a logical operator will evaluate each item in an array to see if it matches the specif…
4. 4
If we have a two-dimensional array, np.mean can calculate the means of the larger array as well as the interior values. Let’s imagine a game of ring toss at a carnival. In this game, you have thr…
5. 5
As we can see, the mean is a helpful way to quickly understand different parts of our data. However, the mean is highly influenced by the specific values in our data set. What happens when one of t…
6. 6
One way to quickly identify outliers is by sorting our data, Once our data is sorted, we can quickly glance at the beginning or end of an array to see if some values lie far beyond the expected ran…
7. 7
Another key metric that we can use in data analysis is the median. The median is the middle value of a dataset that’s been ordered in terms of magnitude (from lowest to highest). Let’s look at …
8. 8
In a dataset, the median value can provide an important comparison to the mean. Unlike a mean, the median is not affected by outliers. This becomes important in skewed datasets, datasets whose va…
9. 9
As we know, the median is the middle of a dataset: it is the number for which 50% of the samples are below, and 50% of the samples are above. But what if we wanted to find a point at which 40% of t…
10. 10
Some percentiles have specific names: - The 25th percentile is called the first quartile - The 50th percentile is called the median - The 75th percentile is called the *third quarti…
11. 11
While the mean and median can tell us about the center of our data, they do not reflect the range of the data. That’s where standard deviation comes in. Similar to the interquartile range, the …
12. 12
As we saw in the last exercise, knowing the standard deviation of a dataset can help us understand how spread out our dataset is. We can find the standard deviation of a dataset using the Numpy f…
13. 13
Let’s review! In this lesson, you learned how to use NumPy to analyze single-variable datasets. Here’s what we covered: - Using the np.sort method to locate outliers. - Calculating central positio…
1. 1
A university wants to keep track of the popularity of different programs over time, to ensure that programs are allocated enough space and resources. You work in the admissions office and are asked…
2. 2
When we first look at a dataset, we want to be able to quickly understand certain things about it: - Do some values occur more often than others? - What is the range of the dataset (i.e., the min …
3. 3
Suppose we had a larger dataset with values ranging from 0 to 50. We might not want to know exactly how many 0’s, 1’s, 2’s, etc. we have. Instead, we might want to know how many values fall betwee…
4. 4
We can graph histograms using a Python module known as Matplotlib. We’re not going to go into detail about Matplotlib’s plotting functions, but if you’re interested in learning more, take our co…
5. 5
Histograms and their datasets can be classified based on the shape of the graphed values. In the next two exercises, we’ll look at two different ways of describing histograms. One way to classify…
6. 6
Most of the datasets that we’ll be dealing with will be unimodal (one peak). We can further classify unimodal distributions by describing where most of the numbers are relative to the peak. A *sym…
7. 7
The most common distribution in statistics is known as the normal distribution, which is a symmetric, unimodal distribution. Lots of things follow a normal distribution: - The heights of a large…
8. 8
We can generate our own normally distributed datasets using NumPy. Using these datasets can help us better understand the properties and behavior of different distributions. We can also use them to…
9. 9
In a normal distribution, we know that the mean and the standard deviation determine certain characteristics of the shape of our data, but how exactly? Let’s do some exploration to find out!
10. 10
We know that the standard deviation affects the “shape” of our normal distribution. The last exercise helps to give us a more quantitative understanding of this. Suppose that we have a normal dis…
11. 11
It’s known that a certain basketball player makes 30% of his free throws. On Friday night’s game, he had the chance to shoot 10 free throws. How many free throws might you expect him to make? We …
12. 12
There are some complicated formulas for determining these types of probabilities. Luckily for us, we can use NumPy - specifically, its ability to generate random numbers. We can use these random nu…
13. 13
Let’s return to our original question: Our basketball player has a 30% chance of making any individual basket. He took 10 shots and made 4 of them, even though we only expected him to make 3. Wh…
14. 14
Let’s review! In this lesson, you learned how to use NumPy to analyze different distributions and generate random numbers to produce datasets. Here’s what we covered: - What is a histogram and how…

## What you'll create

Portfolio projects that showcase your new skills ## How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory 