# Worksheet 1: Introduction to Data Science

Welcome to DSCI 100: Introduction to Data Science!  

Each week you will complete a lecture assignment like this one. Before we get started, there are some administrative details.

You can't learn technical subjects without hands-on practice. The weekly lecture worksheets and tutorials are an important part of the course. The lecture worksheets and tutorials will automatically be collected on the due date. Attendance in lectures and tutorials are required. There will be participatory activities in both the lecture and tutorial to help support your learning.

- The lecture worksheets are worth 5% of your final grade. 
    - Each question is worth 1 point. 
- The tutorial assignments are worth 15% of your final grade. 
    - Each autograded question is worth 1 point. 
    - Each manually graded question is worth 3 points. 

Collaborating on lecture worksheets and tutorial assignments is more than okay -- it's encouraged! You should rarely be stuck for more than a few minutes on questions in lecture or tutorial, so ask a neighbor, TA or an instructor for help (explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it). Please don't just share answers, though. Everyone must submit a copy of their own work.

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* use a Jupyter notebook to execute provided Python code
* edit code and markdown cells in a Jupyter notebook
* create new code and markdown cells in a Jupyter notebook
* load the `pandas` package into Python 
* create new variables and objects in Python
* use the help and documentation tools in Python
* perform the following operations from the `pandas` package:
    - Read a standard .csv file using `read_csv`.
    - subset rows and columns of a data frame using `loc[]`.
    - select columns of a dataframe using `[]`.
    - create a new column of a dataframe using `assign`.
* use `Altair` package to create plots 

In this first worksheet you will also learn how to test the answers you write in this worksheet to assess if you answered questions correctly before your assignment is collected.

This worksheet covers parts of [the Introduction chapter](https://python.datasciencebook.ca/intro.html) of the online textbook. In most worksheets we expect you to read the textbook chapters before completing the worksheet, however we know that might not have been possible for this worksheet, so we have added a bit more help to get you through. You still should however read the chapter to get a deeper understanding of this week's material (it will help you more easily answer the problems in this week's tutorial homework). 

## 1. Jupyter Notebooks
This webpage is called a Jupyter notebook. A notebook is a place to write computer code for analysis, view the results of the analysis, as well as to narrate the analysis with rich formatted text.

### 1.1. Text Cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like ▶ to confirm any changes. (Try not to delete the instructions of the lab.)

**Question 1.1.1**
<br> {points: 0}

This paragraph is in its own text cell.  Try editing it so that all of the sentences following this one are deleted, then click the "run cell" ▶ button .  This sentence, for example, should be deleted.  So should this one.

### 1.2. Code Cells
Other cells contain code in the Python language. Running a code cell will execute all of the code it contains.

To run the code in a cell, first click on that cell to activate it.  It will be highlighted with a blue rectangle to the left of it when activated.  Next, either press Run ▶ or hold down the `shift` key and press `return` or `enter`.

Try running the next cell:

In [None]:
print("Hello, World!")

The above code cell contains a single line of code, but cells can also contain multiple lines of code. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [None]:
print("First this line is printed,")
print("and then this one.")

**Question 1.2.1**
<br> {points: 0}

Change the cell above so that it prints out:

    First this line is printed,
    and then the next line, 
    and then this one.

*Hint:* If you're stuck for more than a few minutes, try talking to a neighbor or a TA.  That's a good idea for any worksheet or tutorial problem.

### 1.3. Writing Jupyter Notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar of this tab.  The newly created cell will start out as a code cell.  You can change it to a text cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart and runall button (⏩) in the menu bar of this tab, and changing it from "Code" to "Markdown".

**Question 1.3.1**
<br> {points: 0}

Add a code cell below this one.  Write code in it that prints out:
   
    A whole new code cell!

Run your cell to verify that it works.

**Question 1.3.2**
<br> {points: 0}

Add a text/Markdown cell below this one. Write the text "A whole new Markdown cell" in it.

### 1.4. Comments
Below you see lines like this in code cells:

    # Test cell; please do not change!

That is called a *comment*.  It doesn't make anything happen in Python; Python ignores anything on a line after a #.  Instead, it's there to communicate something about the code to you, the human reader.  Comments are extremely useful and can help increase how readable our code is.

<img src="http://imgs.xkcd.com/comics/future_self.png">

*Source: https://xkcd.com/1421/*

The below code cell contains comments (one at the start of a line, and one after some other code). Run the cell. You will see that everything after a comment symbol `#` is ignored by Python.

In [None]:
# you can use comments to document your code, or make Python ignore some code without deleting it entirely
# print("this is a commented line that Python will ignore. You won't see this text in the output!")

print("hello!") # you can also put comments at the end of a line of code

### 1.5. Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes (everyone who writes code does, even your course instructor!).  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Uncomment the line below, run the cell and see what happens.

In [None]:
#print("This line is missing something."

![ws1_error_image_python.png](images/ws1_error_image_python.png)

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it.  (Of course, if you're frustrated, ask a neighbor or a TA for help.)

Try to fix the code above so that you can run the cell and see the intended message instead of an error.

### 1.6 The Kernel
The kernel is a program that executes the code inside your notebook and outputs the results. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (⚪), the kernel is idle and ready to execute code. If the circle is filled in (⚫), the kernel is busy running some code. 

You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps:
1. At the top of your screen, click **Kernel**, then **Interrupt Kernel**.
2. If that doesn't help, click **Kernel**, then **Restart Kernel...**. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work.!
3. If that doesn't help, restart your server. First, save your work by clicking **File** at the top left of your screen, then **Save Notebook**. Next, from the **File** menu click **Hub Control Panel**. Choose **Stop My Server** to shut it down, then **My Server** to start it back up. Then, navigate back to the notebook you were working on.

### 1.7 Saving your work

Its important to save your work often so you don't lose your progress! At the top of the screen, go to the **File** menu then **Save Notebook**. There is also a disk icon in the menu of this tab that can be used to save your work as well. Finally, there are keyboard shorcuts for saving your work too: control + s on Windows, or command + s on Mac. Once you've saved your work, you will see a message at the bottom of the screen that says **Saving completed**.

### 1.8. Submitting your work
All lecture worksheets and tutorials assignments in the course will be distributed as notebooks like this one. You will complete your work in this notebook and at the due date we will copy this notebook and grade that copy. For lecture worksheets we will use a system called nbgrader that checks your work. For tutorial assignments we will use a combination of nbgrader and manual grading of your work. 

**Play the Youtube video below to see how to properly answer questions and save in DSCI100 worksheets or tutorials.**

Below are 2 videos that explain:

1. [How to save your Jupyter notebook](https://www.youtube.com/watch?v=0aoLgBoAUSA) and,
2. [How to answer a question in a Jupyter notebook assignment](https://www.youtube.com/watch?v=7j0WKhI3W4s).

If they are not showing up below, uncomment the code in the cell below and run it. You can also click the links above to watch them directly on YouTube.

In [None]:
#from IPython.display import YouTubeVideo

#YouTubeVideo("0aoLgBoAUSA", width=854, height=480)

In [None]:
#from IPython.display import YouTubeVideo

# YouTubeVideo("7j0WKhI3W4s", width=854, height=480)

## 2. Numbers
Quantitative information arises everywhere in data science. In addition to representing commands to print out lines, our Python code can represent numbers and methods of combining numbers. The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)

In [None]:
3.2500

Notice that we didn't have to write `print()`. When you run a notebook cell, Jupyter will helpfully print the last output for you. So in the below cell, the last statement just evaluates to `4`, so it prints `4` (remember -- each line in a cell is a separate line of code!)

In [None]:
2
3
4

If you want to print out results from earlier lines in the cell, you need to use the `print` function.

In [None]:
print(2)
print(3)
print(4)

### 2.1. Arithmetic
The line in the next cell subtracts.  Its value is what you'd expect.  Run it.

In [None]:
2.0 - 1.5

Same with the cell below. Run it.

In [None]:
2 * 2

Many basic arithmetic operations are built in to Python.  [This webpage](https://docs.python.org/3.9/library/operator.html) describes all the arithmetic operators used in the course.  You can refer back to this webpage as you need throughout the term.

## 3. Names
In natural language, we have terminology that lets us quickly reference very complicated concepts.  We don't say, "That's a large mammal with brown fur and sharp teeth!"  Instead, we just say, "Bear!"

Similarly, an effective strategy for writing code is to define names for data as we compute it, like a lawyer would define terms for complex ideas at the start of a legal document to simplify the rest of the writing.

In Python, we do this with *objects*. An object has a name on the left side of an `=` sign and an expression to be evaluated on the right.

In [None]:
answer = 3 * 2 + 4

When you run that cell, Python first evaluates the first line.  It computes the value of the expression `3 * 2 + 4`, which is the number 10.  Then it gives that value the name `answer`.  At that point, the code in the cell is done running.

After you run that cell, the value 10 is bound to the name `answer`:

In [None]:
answer

We can name our objects anything we'd like. Above we called it `answer`, but we could have named it `value`, `data` or anything else we desired. A good rule of thumb is to name it something that has meaning to a human as it relates to what we are trying to accomplish with our Python code.

**Question 3.1**
<br> {points: 0}

Enter a new code cell. Try creating another object using `= 3 * 2 + 4` with a name different from `answer`.

A common pattern in Jupyter notebooks is to assign a value to a name and then immediately evaluate the name in the last line in the cell so that the value is displayed as output.

In [None]:
close_to_pi = 355 / 113
close_to_pi

Another common pattern is that a series of lines in a single cell will build up a complex computation in stages, naming the intermediate results.

In [None]:
bimonthly_salary = 840
monthly_salary = 2 * bimonthly_salary
number_of_months_in_a_year = 12
yearly_salary = number_of_months_in_a_year * monthly_salary
yearly_salary

When naming objects in Python there are some rules:
1. Names in Python can have letters (upper- and lower-case letters are both okay and count as different letters e.g. "Answer" and "answer" will be treated as different objects), underscores, and numbers. 
2. The first character can't be a number (otherwise a name might look like a number).  
3. Names can't contain spaces, since spaces are used to separate pieces of code from each other. Instead, it is common to use an underscore character _ to replace each space.
4. Names can't contain other special characters such as -, +, #, $, %, ^ since some characters have special roles in Python. Take # for example, this character specifies a comment within written code.

Other than those rules, what you name something doesn't matter *to Python*.  For example, the next cell does the same thing as the above cell, except everything has a different name:

In [None]:
a = 840
b = 2 * a
c = 12
d = c * b
d

**However**, names are very important for making your code *readable* to yourself and others.  The cell above is shorter, but it's totally useless without an explanation of what it does. 

There is also cultural style associated with different programming languages. In the modern Python style, object names should use only lowercase letters, numbers, and `_`. Underscores (`_`) are typically used to separate words within a name (*e.g.*, `answer_one`).

**Question 3.2** <br> {points: 1}

Assign the name `seconds_in_an_hour` to the number of seconds in an hour. You should do this in two steps. In the first, you calculate the number of seconds in a minute and assign that number the name `seconds_in_a_minute`. Next you should calculate the number of seconds in an hour and assign that number the name `seconds_in_an_hour.`  

*Hint - there are 60 seconds in a minute and 60 minutes in a hour*

In [None]:
# your code here
raise NotImplementedError

# We've put this line in this cell so that it will print
# the value you've given to seconds_in_an_hour when you
# run it.  You don't need to change this.
seconds_in_an_hour

### 3.2. Checking your code


Now that you know how to name things, you can start using the built-in *tests* to check whether your work is correct.

Below is an example of a test cell for Question 3.2 above (assesses whether you have assigned `seconds_in_an_hour` correctly). If you haven't, this test will tell you that your solution is incorrect. Try not to change the contents of the test cells. Resist the urge to just copy it, and instead try to adjust your expression. (Sometimes the tests will give hints about what went wrong...)

In [None]:
from hashlib import sha1
assert sha1(str(type(seconds_in_a_minute)).encode("utf-8")+b"f63740cc00b619a8").hexdigest() == "3477a35744ae914d16ea34e3ef0503296d1c60fa", "type of seconds_in_a_minute is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(seconds_in_a_minute).encode("utf-8")+b"f63740cc00b619a8").hexdigest() == "00b3dd93aa1831461b8ead8a8420a12b56f5a557", "value of seconds_in_a_minute is not correct"

assert sha1(str(type(seconds_in_an_hour)).encode("utf-8")+b"accddfbdb25cbb86").hexdigest() == "9318290a0f9316c20adb638b9ca35ef30802ba13", "type of seconds_in_an_hour is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(seconds_in_an_hour).encode("utf-8")+b"accddfbdb25cbb86").hexdigest() == "58b911d62083514b7978a0b375da4244c33afa17", "value of seconds_in_an_hour is not correct"

print('Success!')

For this first question we'll provide you the solution:

In [None]:
# Calculate the number of seconds in an hour.

# SOLUTION:
seconds_in_a_minute = 60
seconds_in_an_hour = seconds_in_a_minute * 60

# We've put this line in this cell so that it will print
# the value you've given to seconds_in_an_hour when you
# run it.  You don't need to change this.
seconds_in_an_hour

*Note: All autograded questions with visible tests in this course are worth 1 point.*

## 4. Calling Functions/Methods and Attributes

The most common way to combine or manipulate values in Python is by calling functions or methods. Python comes with many built-in functions and methods that perform common operations. You can think of functions and methods as verbs that do things. And objects in Python like nouns, which are entities that exist.

In the module, we explored examples of functions and methods such as `print`. Here, we'll demonstrate using another method `upper` that converts text to uppercase:

In [None]:
greeting = "Why, hello there!".upper()
greeting

> The `upper` method we used here is different from the functions we used previously (e.g. `print`). This method is called using the dot notation (`string.upper()`), because this method only works for a particular kind of object that they were designed to work for. Here the `upper` method was written to only work with string objects. `print` function, however, was written to work with many kinds of objects, therefore, we don't use the dot notation.

**Question 4.0** <br> {points: 1} 

Use the method `lower` to change all the words in the following movie title to lower case text: "The House with a Clock in Its Walls" and assign the lower case text the name `title`.

In [None]:
# your code here
raise NotImplementedError
title

In [None]:
from hashlib import sha1
assert sha1(str(type(title)).encode("utf-8")+b"4a02a9fac26f187c").hexdigest() == "74e6c7cd604c219a09657981da8cac1adb50e305", "type of title is not str. title should be an str"
assert sha1(str(len(title)).encode("utf-8")+b"4a02a9fac26f187c").hexdigest() == "cc89838a600cb96ac4f19d08a922e8a008ddec11", "length of title is not correct"
assert sha1(str(title.lower()).encode("utf-8")+b"4a02a9fac26f187c").hexdigest() == "0ca4232ae884265060d5127e1d428043a1936534", "value of title is not correct"
assert sha1(str(title).encode("utf-8")+b"4a02a9fac26f187c").hexdigest() == "0ca4232ae884265060d5127e1d428043a1936534", "correct string value of title but incorrect case of letters"

print('Success!')

### 4.1. Multiple Arguments
Some functions take multiple arguments, separated by commas. For example, the built-in `max` function returns the maximum argument passed to it.

In [None]:
biggest = max(2, 15, 4, 7)
biggest

**Question 4.1** <br> {points: 1}

Use the `min` function to find the minumum value of the numbers in the cell above.

Assign the value to an object called `smallest`.

In [None]:
# your code here
raise NotImplementedError
smallest

In [None]:
from hashlib import sha1
assert sha1(str(type(smallest)).encode("utf-8")+b"4eea41f2186b858d").hexdigest() == "510a0d1322c553ff0a9812872f837985fab18af9", "type of smallest is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(smallest).encode("utf-8")+b"4eea41f2186b858d").hexdigest() == "16fd87481587d2508b6cc8d19a93d6cebd8854b4", "value of smallest is not correct"

print('Success!')

## 5. Packages
Python has many built-in functions, but we can also use functions that are stored within packages created by other Python users. We are going to use a package, called `pandas`, to load, modify and plot data.
This package has already been installed for you. Later in the course you will learn how to install packages so you are free to bring in other tools as you need them for your data analysis.

To use the functions from a package you first need to load it using the `import` function. This needs to be done once per notebook (and a good rule of thumb is to do this at the very top of your notebook so it is easy to see what packages your Python code depends on). 

Here we also give `pandas` a nickname of `pd`, formally called an alias. This lets us refer to the pandas package more efficiently by just typing `pd` instead of `pandas`. Referring to packages with aliases is very common in Python and you will see us do this with many of the packages we use in this course.

In [None]:
import pandas as pd

**Question 5.1** <br> {points: 1} 

Use the `import` function to load the `numpy` Python package as `np`.

In [None]:
import sys

# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(('numpy' in sys.modules))).encode("utf-8")+b"bf63784a0c69bee8").hexdigest() == "a2e7805fff17c59d5a3468875322a2367d784d5c", "type of ('numpy' in sys.modules) is not bool. ('numpy' in sys.modules) should be a bool"
assert sha1(str(('numpy' in sys.modules)).encode("utf-8")+b"bf63784a0c69bee8").hexdigest() == "6bc7e0c230b8393a77b624b1c6e7539605237284", "boolean value of ('numpy' in sys.modules) is not correct"

assert sha1(str(type((np.__name__ == 'numpy'))).encode("utf-8")+b"cccdab030c8b29ca").hexdigest() == "202f26d6310e1f7ff6be7f4a604e3f2b552aa238", "type of (np.__name__ == 'numpy') is not bool. (np.__name__ == 'numpy') should be a bool"
assert sha1(str((np.__name__ == 'numpy')).encode("utf-8")+b"cccdab030c8b29ca").hexdigest() == "0f6ea0dcc7eac6bb48393e207c0c9f6eb1bf9b2f", "boolean value of (np.__name__ == 'numpy') is not correct"

print('Success!')

## 6. Looking for Help

No one, even experienced, professional programmers remember what every function does, nor do they remember every possible function argument/option. So both experienced and new programmers (like you!) need to look things up, A LOT! 

### 6.1. Help Files
One of the most efficient places to look for help on how a function works is the Python documentation. Let’s say we wanted to pull up the documentation for the `read_csv` method in pandas. We can do this by typing the `?` character followed by the name we want more information about. Another way to view the documentation is to place the cursor on the name and then press `shift` + `tab`, or by clicking on the `Help` text
in the menu bar at the top and then selecting `Show Contextual Help`, as described in detail in the textbook.

Run the cell below to find out more about `.read_csv` function from the `pandas` package.

In [None]:
?pd.read_csv

At the very top of the output, you will see the function itself and its arguments. Next is a description of what the function does. The bottom of the file specifies the package it is in (in this case, it is pandas). You’ll find that the most helpful sections on this page are “Parameters” and "Examples". 

- **Docstring** at the top gives you an idea of how you would use the function when coding--what the syntax would be and how the function itself is structured.
- **Parameters** tells you the different parts that can be added to the function to make it more simple or more complicated. Often the “Parameters” sections doesn’t provide you with step by step instructions, because there are so many different ways that a person can incorporate a function into their code. Instead, they provide users with a general understanding as to what the function could do and parts that could be added. At the end of the day, the user must interpret the help file and figure out how best to use the functions and which parts are most important to include for their particular task. 
- The **Returns** explains what to expect as an output.
- The **Examples** section is often the most useful part of the help file as it shows how a function could be used with real data. It provides a skeleton code that the users can work off of.
- Sometimes there is a **See Also** section which may suggest similar functions that could be of use to the user.

Beyond the Python help files there are many resources that you can use to find help. [Stack overflow](https://stackoverflow.com/), an online forum, is a great place to go and ask questions such as how to perform a complicated task in Python or why a specific error message is popping up. Oftentimes, a previous user will have already asked your question of interest and received helpful advice from fellow Python users.

**Question 6.1** Multiple Choice:
<br> {points: 1}

Use `?pd.read_csv` to answer the multiple choice question below. To answer the question, assign the letter associated with the correct answer to a variable in the the code cell below:

Which statement below is accurate?

A. `pd.read_csv` is useful for reading comma-separated values (csv) file into DataFrame.

B. It can accept a possible parameter of `warnings=True`.

C. The parameter delimiter is an alias for the parameter squeeze.

D. `pd.read_csv` is perfect for reading a table of fixed-width formatted lines into DataFrame.

*Assign your answer to an object called `answer6_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer6_1)).encode("utf-8")+b"883c4754b8b7a3f9").hexdigest() == "6ef96ab387b289051f15fe12f1999a6b5b0e33c1", "type of answer6_1 is not str. answer6_1 should be an str"
assert sha1(str(len(answer6_1)).encode("utf-8")+b"883c4754b8b7a3f9").hexdigest() == "26d3690b8eb0adbf5db9718db7cd3dc913366a42", "length of answer6_1 is not correct"
assert sha1(str(answer6_1.lower()).encode("utf-8")+b"883c4754b8b7a3f9").hexdigest() == "989c06535122a1e83624b49f0a2867f063bfe178", "value of answer6_1 is not correct"
assert sha1(str(answer6_1).encode("utf-8")+b"883c4754b8b7a3f9").hexdigest() == "e3f0f0d669a2123352473067a8698a5f147ac10d", "correct string value of answer6_1 but incorrect case of letters"

print('Success!')

## 7. Pandas Functions 

Now that we have learned a little about Jupyter notebooks and Python, let's load a real dataset into Python and explore it. As we do this we will learn more about key data loading, wrangling and visualization functions in Python. 

### Exercise: Data about Runners!
Researchers, Vickers and Vertosick performed [a study in 2016](https://bmcsportsscimedrehabil.biomedcentral.com/articles/10.1186/s13102-016-0052-y) that aimed to identify what factors had a relationship with race performance of recreational runners so that they could better predict future 5 km, 10 km and marathon race times for individual runners. Such predictions (and knowing what drives these predictions) can help runners by suggesting changes they could make to modifiable factors, such as training, to help them improve race time. Unmodifiable factors that contribute to the prediction, such as age or sex, allow for fair comparisons to be made between different runners.

Vickers and Vertosick reasoned that their study is important because all previous research done to predict races times has focused on data from elite athletes. This biased data set means that the predictions generated from them do not necessarily do a good job predicting race times for recreational runners (whose data was not in the dataset that was used to create the model that generates the predictions). Additionally, previous research focused on reporting/measuring factors that require special expertise or equipment that are not freely available to recreational runners. This means that recreational runners may not be able to put their characteristics/measurements for these factors in the race time prediction models and so they will not be able to obtain an accurate prediction, or a prediction at all (in the case of some models).

To make a better model, Vickers and Vertosick performed a large survey. They put their survey on the news website [Slate.com](https://slate.com/) attached to a news story about race time prediction. They were able to obtain 2,497 responses. The survey included questions that allowed them to collect a data set that included: 
- age,
- sex,
- body mass index (BMI),
- whether they are an edurance runner or speed demon,
- what type of shoes they wear,
- what type of training they do,
- race time for 2-3 races they completed in the last 6 months,
- self-rated fitness for each race,
- and race difficulty for each race.


Let's now use this data to explore a question we might be interested in - is there a relationship between 10 km race time and body mass index (BMI) for male runners in this data set. This is an exploratory data analysis question because we stated we looking for a relationship between measurements within the single data set we have and are not interested in yet interpreting beyond it. We can answer this question by visualizing the data as a scatter plot using Python.

If, however we are not aiming to extend our findings to a broader population, make predictions, analyze cause or mechanics, we would need to state a different data analyis question and follow-up with different analytical methods to answer that question.

To answer our exploratory question (is there a relationship between 10 km race time and body mass index (BMI) for men runners in this data set), we will need to do the following things in Python:

1. load the data set into Python
2. subset the data we are interested in visualizing from the loaded dataset
3. create a new column to get the unit of time in minutes instead of seconds
4. create a scatter plot using this modified data

> *Note 1 - subsetting the data and converting from seconds to minutes is not absolutely required to answer our question, but it will give us practice manipulating data in Python, and make our data tables and figures more readable.*
>
> *Note 2 - many historical datasets treated sex as a variable where the possible values are only binary: male or female. This representation in this question reflects how the data were historically collected and is not meant to imply that we believe that sex is binary.*

**Question 7.0.1** Multiple Choice:
<br> {points: 1}

Which of the following will you *not* find included in Vickers and Vertosick's data set?

A. age

B. what each runner ate before the race 

C. body mass index

D. self-rated fitness for each race



*Assign your answer to an object called `answer7_0_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer7_0_1)).encode("utf-8")+b"e95a1a9739dd8f02").hexdigest() == "f0fc578f4448ba4b0222b897ea62ead7b00c6aec", "type of answer7_0_1 is not str. answer7_0_1 should be an str"
assert sha1(str(len(answer7_0_1)).encode("utf-8")+b"e95a1a9739dd8f02").hexdigest() == "1a671eea6191956499dde85d140af122e1b98fbd", "length of answer7_0_1 is not correct"
assert sha1(str(answer7_0_1.lower()).encode("utf-8")+b"e95a1a9739dd8f02").hexdigest() == "79ed31bb1136ab2d71571ad4f818f241c821c7b1", "value of answer7_0_1 is not correct"
assert sha1(str(answer7_0_1).encode("utf-8")+b"e95a1a9739dd8f02").hexdigest() == "ff4e126d329172756e2d8da6cad72647d1e2f1d1", "correct string value of answer7_0_1 but incorrect case of letters"

print('Success!')

**Question 7.0.2** True or False: 
<br> {points: 1} 

The researchers compiled this data so that they could build better models to predict marathon race times. 

*Assign your answer to an object called `answer7_0_2`. Make sure your answer is either `True` or `False`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer7_0_2)).encode("utf-8")+b"2836086c95a7907f").hexdigest() == "45eb09084073819c0a304d4790629bcb81670bf1", "type of answer7_0_2 is not bool. answer7_0_2 should be a bool"
assert sha1(str(answer7_0_2).encode("utf-8")+b"2836086c95a7907f").hexdigest() == "f1c2021a498614a2ff707056aa4717c9b7577b98", "boolean value of answer7_0_2 is not correct"

print('Success!')

**Question 7.0.3** Multiple Choice: 
<br> {points: 1}

What kind of graph will we be creating? Choose the correct answer from the options below. 

A. Bar Graph 

B. Pie Chart

C. Scatter Plot

D. Box Plot 

*Assign your answer to an object called `answer7_0_3`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer7_0_3)).encode("utf-8")+b"69f0b8467af40818").hexdigest() == "e42b5befbba34eca5df749889358d48033300d10", "type of answer7_0_3 is not str. answer7_0_3 should be an str"
assert sha1(str(len(answer7_0_3)).encode("utf-8")+b"69f0b8467af40818").hexdigest() == "5f13f211f39f378913515d3f2c23cfc4b5a4f231", "length of answer7_0_3 is not correct"
assert sha1(str(answer7_0_3.lower()).encode("utf-8")+b"69f0b8467af40818").hexdigest() == "e4ed89359a4b65d72707d4dc013d3926f49f0ae0", "value of answer7_0_3 is not correct"
assert sha1(str(answer7_0_3).encode("utf-8")+b"69f0b8467af40818").hexdigest() == "af40803cebbf272c09f500c64833537c55585e2e", "correct string value of answer7_0_3 but incorrect case of letters"

print('Success!')

### 7.1. Reading Data

Let's get started with our first step - loading the data set. The data set we are loading is called `marathon_small.csv` and it contains a subset of the data from the study described above. The file is in the same directory/folder as the file for this notebook. It is a comma separated file (meaning the columns are separated by the `,` character). We often refer to these files as `.csv`'s.


```
age,bmi,km5_time_seconds,km10_time_seconds,sex
25.0,21.6221160888672,NA,2798,female
41.0,23.905969619751,1210.0,NA,male
25.0,21.6407279968262,994.0,NA,male
35.0,23.5923233032227,1075.0,2135,male
34.0,22.7064037322998,1186.0,NA,male
45.0,42.0875434875488,3240.0,NA,female
33.0,22.5182952880859,1292.0,NA,male
58.0,25.2340793609619,NA,3420,male
29.0,24.505407333374,1440.0,3240,male
```

We can use the `pd.read_csv` function from the `pandas` package to do this. Below is an example of reading a `.csv` file that is in the same directory/folder as the file for the notebook that would be reading it in:

<img src="images/ws1_read_csv_gen_py.png" width="500">

*Note - the quotes around the filename are important and you will get an error if you forget them.*

**Question 7.1.1** <br> {points: 1}

Use the `pd.read_csv` function from `pandas` package to load the data from the `marathon_small.csv` file into Python. Save the data to an object called `marathon_small`. If you need additional help try `?pd.read_csv` and/or ask your neighbours or the Instructional team for help.

In [None]:
import pandas as pd

# your code here
raise NotImplementedError
marathon_small

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_small)).encode("utf-8")+b"2b2cd8a8f7191b99").hexdigest() == "f70fd1cb365e7ef849abb68e07fcded0b8f51544", "type of type(marathon_small) is not correct"

assert sha1(str(type(marathon_small.shape)).encode("utf-8")+b"554d659718d61ba6").hexdigest() == "37d9acd1ce275b3115cb09c44b01b4912cb468b3", "type of marathon_small.shape is not tuple. marathon_small.shape should be a tuple"
assert sha1(str(len(marathon_small.shape)).encode("utf-8")+b"554d659718d61ba6").hexdigest() == "ac4aa5352c2f9f41b21ea9aacef2791c1ef4930d", "length of marathon_small.shape is not correct"
assert sha1(str(sorted(map(str, marathon_small.shape))).encode("utf-8")+b"554d659718d61ba6").hexdigest() == "991ba07577ed8a6d05d3e14f114b2fde519b585f", "values of marathon_small.shape are not correct"
assert sha1(str(marathon_small.shape).encode("utf-8")+b"554d659718d61ba6").hexdigest() == "b73f34a6fa67f28c61b09dac0b47d9fb7fe17ef4", "order of elements of marathon_small.shape is not correct"

assert sha1(str(type(sum(marathon_small.age))).encode("utf-8")+b"bd89c56249909613").hexdigest() == "899d284abdf751ed98eb68ba0d5e12fb6ceb3e4f", "type of sum(marathon_small.age) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_small.age), 2)).encode("utf-8")+b"bd89c56249909613").hexdigest() == "a518ff73378773221c197311edd70ca545c82a36", "value of sum(marathon_small.age) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(marathon_small.columns.values)).encode("utf-8")+b"2ced1db6836e8c39").hexdigest() == "fa85611df34b71d39b67d8094d4004a5a88e4bce", "type of marathon_small.columns.values is not correct"
assert sha1(str(marathon_small.columns.values).encode("utf-8")+b"2ced1db6836e8c39").hexdigest() == "15ff054fbb79b56c3d79c26a21acb8835e46f6b9", "value of marathon_small.columns.values is not correct"

print('Success!')

**Question 7.1.2** Multiple Choice <br> {points: 1}

From the list below, which is a valid way to store a data frame object read in from `pd.read_csv` to an object in Python?

A. `data == pd.read_csv("example_file.csv")`

B. `data = pd.read_csv("example_file.csv")`

C. `data = pd.read_csv"example_file.csv"`

D. `data = pd.read_csv(example_file.csv)`

*Assign your answer to an object called `answer7_1_2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer7_1_2)).encode("utf-8")+b"cc7f1e4f1b65d80f").hexdigest() == "e77058f13b23c94bbca2e0fa929e1398f7be3ea0", "type of answer7_1_2 is not str. answer7_1_2 should be an str"
assert sha1(str(len(answer7_1_2)).encode("utf-8")+b"cc7f1e4f1b65d80f").hexdigest() == "f38386450d509d040820b52b4378b3477460756f", "length of answer7_1_2 is not correct"
assert sha1(str(answer7_1_2.lower()).encode("utf-8")+b"cc7f1e4f1b65d80f").hexdigest() == "629a384db2e71cc7ca53221944d701682bf45c3b", "value of answer7_1_2 is not correct"
assert sha1(str(answer7_1_2).encode("utf-8")+b"cc7f1e4f1b65d80f").hexdigest() == "fc42f4a0c2d7677ca8a98724586c050e5c0c587a", "correct string value of answer7_1_2 but incorrect case of letters"

print('Success!')

### 7.2. Data frames

The functions from the `pandas` package give us a data frame and we can look at the structure of a data frame by simply writing its name to view the output.

In [None]:
marathon_small

This returns the first 5 and last 5 rows of the data frame, and hides the middle rows with an ellipsis (`...`).

By default, the first row of a data set is always the **header** that `pd.read_csv` uses to label the column. Therefore, the first row contains descriptive names while the rows below contain the actual data. The bolded column on the left without a header is called the index. For now you can think of this is the row numbers of the data frame.

This only shows us a small portion of the data set. You can look at more of the data set by using the `head` method to specify the number of rows you want to print.

In [None]:
marathon_small.head(50)

This shows us the first 50 rows of the data set. We could look at the entire data by changing the `n` argument but looking at many rows of data can be very long and unnecessary to look at.

**Question 7.2.1** <br> {points: 1}

To know how many rows and columns there are, use the method `shape`. Assign the number of rows and columns to the object `rows_and_columns`.

In [None]:
# your code here
raise NotImplementedError
print(rows_and_columns)

In [None]:
from hashlib import sha1
assert sha1(str(type(rows_and_columns)).encode("utf-8")+b"2643cd4864760c7b").hexdigest() == "b74ee382d9f6255ecfc54cfbedde786c480b1079", "type of rows_and_columns is not tuple. rows_and_columns should be a tuple"
assert sha1(str(len(rows_and_columns)).encode("utf-8")+b"2643cd4864760c7b").hexdigest() == "c805f45f38681b1a3af1c5161fc6cd0031e9ab81", "length of rows_and_columns is not correct"
assert sha1(str(sorted(map(str, rows_and_columns))).encode("utf-8")+b"2643cd4864760c7b").hexdigest() == "38f75643d15c8c19ea3b7ec5f567254da2148e8d", "values of rows_and_columns are not correct"
assert sha1(str(rows_and_columns).encode("utf-8")+b"2643cd4864760c7b").hexdigest() == "44f6b1a1968095d7eecf8e9646c902213bb581b7", "order of elements of rows_and_columns is not correct"

print('Success!')

### 7.3. Obtaining a subset of rows OR columns with `[]`

One of the most common operations on a data frame is to *filter* its rows (observations) to keep only specific rows based on their entries in one or more columns. To do this we can use the `[]` operation on a `pandas` data frame.

For example, if we had a data frame (named `data`) that looked like this:

```
  colour size speed
1    red   15  12.3
2   blue   19  34.1
3   blue   20  23.2
4    red   22  21.9
5   blue   12  33.6
6   blue   23  28.8
```

We could use the first line of the code in the image below to filter for rows where the colour has the value of "blue". The second line of code below would let us filter for rows where the size has a value greater than 20.

<img src="images/ws1_filter_gen_py.png" width="500">

**Question 7.3.1** <br> {points: 1}

Use the `[]` operation to subset your data frame `marathon_small` so it only contains survey data from males. Assign your new filtered data frame to an object called `marathon_filtered_rows`.

In [None]:
# your code here
raise NotImplementedError
marathon_filtered_rows

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_filtered_rows.shape)).encode("utf-8")+b"ccfe0e29d319f929").hexdigest() == "7b1acb62efc3615dde2f5348970ca1d488afe636", "type of marathon_filtered_rows.shape is not tuple. marathon_filtered_rows.shape should be a tuple"
assert sha1(str(len(marathon_filtered_rows.shape)).encode("utf-8")+b"ccfe0e29d319f929").hexdigest() == "407a5f3d08feb98b7ccfa547baa5ff069aa3f95e", "length of marathon_filtered_rows.shape is not correct"
assert sha1(str(sorted(map(str, marathon_filtered_rows.shape))).encode("utf-8")+b"ccfe0e29d319f929").hexdigest() == "d2d10516859e96b45e92bb81077ad68341ab9b07", "values of marathon_filtered_rows.shape are not correct"
assert sha1(str(marathon_filtered_rows.shape).encode("utf-8")+b"ccfe0e29d319f929").hexdigest() == "74f90c301eded60f5da492e4bffcb09cb7e7ea08", "order of elements of marathon_filtered_rows.shape is not correct"

assert sha1(str(type(marathon_filtered_rows.columns.values)).encode("utf-8")+b"2f3446aebf8f5c43").hexdigest() == "efd67b5e45f1626067ce6a30599317177513c90c", "type of marathon_filtered_rows.columns.values is not correct"
assert sha1(str(marathon_filtered_rows.columns.values).encode("utf-8")+b"2f3446aebf8f5c43").hexdigest() == "83d6285fd3cc1d9301fe6c84dbbb876ca59a5e1b", "value of marathon_filtered_rows.columns.values is not correct"

assert sha1(str(type(sum(marathon_filtered_rows.bmi))).encode("utf-8")+b"e87f6e5a1c41aaf4").hexdigest() == "b80e7deb6dd045be7170931e580b886466c66d47", "type of sum(marathon_filtered_rows.bmi) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_filtered_rows.bmi), 2)).encode("utf-8")+b"e87f6e5a1c41aaf4").hexdigest() == "7636ed74b0e2fda6463f8ad2fbe032e33447258f", "value of sum(marathon_filtered_rows.bmi) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 7.3.2** <br> {points: 1}

The `[]` operation can also be used to subset columns via the syntax `data[['column1, 'column2']]`. Use the `[]` operation to subset your data frame `marathon_small` so it only contains the columns "bmi" and "km10_time_seconds". Assign your new filtered data frame to an object called `marathon_filtered_columns`.

In [None]:
# your code here
raise NotImplementedError
marathon_filtered_columns

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_filtered_columns.shape)).encode("utf-8")+b"885ff3263b7b102c").hexdigest() == "b5cc4cfcc9708c03d74daabdfc7199c66f68883a", "type of marathon_filtered_columns.shape is not tuple. marathon_filtered_columns.shape should be a tuple"
assert sha1(str(len(marathon_filtered_columns.shape)).encode("utf-8")+b"885ff3263b7b102c").hexdigest() == "c8749996a2013e1776cd445bcbc9120971edb9ae", "length of marathon_filtered_columns.shape is not correct"
assert sha1(str(sorted(map(str, marathon_filtered_columns.shape))).encode("utf-8")+b"885ff3263b7b102c").hexdigest() == "e446e84168ae47dac43734f3d8c9c56e44d478b3", "values of marathon_filtered_columns.shape are not correct"
assert sha1(str(marathon_filtered_columns.shape).encode("utf-8")+b"885ff3263b7b102c").hexdigest() == "36d3ef6afa6647512e54685a9c536cd4cd0a6882", "order of elements of marathon_filtered_columns.shape is not correct"

assert sha1(str(type(marathon_filtered_columns.columns.values)).encode("utf-8")+b"8bca4c14844dd0cb").hexdigest() == "0186e1a49efec1a01105f6d335bd27fe651db204", "type of marathon_filtered_columns.columns.values is not correct"
assert sha1(str(marathon_filtered_columns.columns.values).encode("utf-8")+b"8bca4c14844dd0cb").hexdigest() == "1fe973701c79090253e0f4ce0aa2143a6fff9737", "value of marathon_filtered_columns.columns.values is not correct"

assert sha1(str(type(sum(marathon_filtered_columns.bmi))).encode("utf-8")+b"b66e645315e45d96").hexdigest() == "1b52517e61df82f8e200a943aaa5b1f9e472806d", "type of sum(marathon_filtered_columns.bmi) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_filtered_columns.bmi), 2)).encode("utf-8")+b"b66e645315e45d96").hexdigest() == "707b648d56cd513591f7907fc5c4f3f95c9722fb", "value of sum(marathon_filtered_columns.bmi) is not correct (rounded to 2 decimal places)"

print('Success!')

### 7.4. Obtaining a subset of rows AND columns with `loc[]`

The `[]` operation is only used when you want to either filter rows **or** select columns;
it cannot be used to do both operations at the same time. This is where `loc[]`
comes in. When we use `loc` to select columns and rows by labels in a dataframe we always specify row condition first, and then the list of columns we want: `data.loc[data['column1'] == row_condition, ['column1', 'column2']]`.

**Question 7.4.1** <br> {points: 1}

Use `loc` to keep only the male runners and the columns `bmi` and `km10_time_seconds` from `marathon_small`, i.e. perform both the steps from the previous two question in a single operation. Assign your new filtered data frame to an object called `marathon_male`. 

*Make sure you select `bmi` first and then `km10_time_seconds`*!

In [None]:
# your code here
raise NotImplementedError
marathon_male

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_male.shape)).encode("utf-8")+b"995b372cd6a9652e").hexdigest() == "ddcaa4ea162bc1c6856add1a7684206df48377b4", "type of marathon_male.shape is not tuple. marathon_male.shape should be a tuple"
assert sha1(str(len(marathon_male.shape)).encode("utf-8")+b"995b372cd6a9652e").hexdigest() == "6a9c481398c02c958ec9ad58480f464cb3228dd0", "length of marathon_male.shape is not correct"
assert sha1(str(sorted(map(str, marathon_male.shape))).encode("utf-8")+b"995b372cd6a9652e").hexdigest() == "6029c487cc4e821374ae8a9e066324cb79c8806a", "values of marathon_male.shape are not correct"
assert sha1(str(marathon_male.shape).encode("utf-8")+b"995b372cd6a9652e").hexdigest() == "c177b9a27918d7b97ce4c2a02c775c944d50b326", "order of elements of marathon_male.shape is not correct"

assert sha1(str(type(sum(marathon_male.bmi))).encode("utf-8")+b"fb47962904e3324b").hexdigest() == "c8a9bd81406eb8244d2dbeef532b6e570b64462a", "type of sum(marathon_male.bmi) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_male.bmi), 2)).encode("utf-8")+b"fb47962904e3324b").hexdigest() == "e022ff005b3a68415bc95635ded3d7a539736131", "value of sum(marathon_male.bmi) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(marathon_male.km10_time_seconds.dropna()))).encode("utf-8")+b"c2f26cbb762f44d9").hexdigest() == "3089c6acb1e136abd5f35f2180a64939504a684b", "type of sum(marathon_male.km10_time_seconds.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_male.km10_time_seconds.dropna()), 2)).encode("utf-8")+b"c2f26cbb762f44d9").hexdigest() == "2423600689c9bee17dbee418376f1cd82b6f67b9", "value of sum(marathon_male.km10_time_seconds.dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 7.4.2** <br> {points: 1}

What are the units of the time taken to complete a run of 10 km? Assign your answer to an object called `answer7_4_2`. Write your answer in lower case. Place your answer between quotation marks.


*Hint: scroll up and look at the introduction to this exercise.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer7_4_2)).encode("utf-8")+b"d03d5ca0fa8c0296").hexdigest() == "ca7dc887975286b3c66f158821666ac1a1b3375a", "type of answer7_4_2 is not str. answer7_4_2 should be an str"
assert sha1(str(len(answer7_4_2)).encode("utf-8")+b"d03d5ca0fa8c0296").hexdigest() == "56e37365ae73aee65ef3833206676a3d93381a8a", "length of answer7_4_2 is not correct"
assert sha1(str(answer7_4_2.lower()).encode("utf-8")+b"d03d5ca0fa8c0296").hexdigest() == "031b261d8fe81a4f833e5e2fe80f9585af16c286", "value of answer7_4_2 is not correct"
assert sha1(str(answer7_4_2).encode("utf-8")+b"d03d5ca0fa8c0296").hexdigest() == "031b261d8fe81a4f833e5e2fe80f9585af16c286", "correct string value of answer7_4_2 but incorrect case of letters"

print('Success!')

**Question 7.4.3**
<br> {points: 1}

What are the units for time (e.g., seconds, minutes, hours) that we would like to use when plotting BMI against time taken to run 10 km? Assign your answer to an object called `answer7_4_3`. Write your answer in lower case. Place your answer between quotation marks.

*Hint: scroll up and look at the introduction to this exercise.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer7_4_3)).encode("utf-8")+b"2134eba6c7c6f7c4").hexdigest() == "9c38bd9d2ab481520b1802add29d9f0132d73cfe", "type of answer7_4_3 is not str. answer7_4_3 should be an str"
assert sha1(str(len(answer7_4_3)).encode("utf-8")+b"2134eba6c7c6f7c4").hexdigest() == "8eea4dad00dd232416584f1afc399f30c05c49c5", "length of answer7_4_3 is not correct"
assert sha1(str(answer7_4_3.lower()).encode("utf-8")+b"2134eba6c7c6f7c4").hexdigest() == "380f784a41f5a93be8ac54479631d4b94e9a9d64", "value of answer7_4_3 is not correct"
assert sha1(str(answer7_4_3).encode("utf-8")+b"2134eba6c7c6f7c4").hexdigest() == "380f784a41f5a93be8ac54479631d4b94e9a9d64", "correct string value of answer7_4_3 but incorrect case of letters"

print('Success!')

### 7.5. Assign

The method `assign` is used to add columns to a dataset, typically by making use of existing columns to compute a new column. 

<img src="images/ws1_mutate_gen_py.png">

In the example above, we are creating a new column named `new_column` that is equal to `old_column * 10` and saving the results to an object called `data_mutated`.

**Question 7.5.1**<br> {points: 1}

Add a new column to our `marathon_male` dataset called `km10_time_minutes` that is equal to `km10_time_seconds/60.` Assign your answer to an object called `marathon_minutes`.

In [None]:
# your code here
raise NotImplementedError
marathon_minutes

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_minutes.shape)).encode("utf-8")+b"f33f51462f3e9b3d").hexdigest() == "226430568940aa5712765309c164aebac4725224", "type of marathon_minutes.shape is not tuple. marathon_minutes.shape should be a tuple"
assert sha1(str(len(marathon_minutes.shape)).encode("utf-8")+b"f33f51462f3e9b3d").hexdigest() == "961e32ddef48c185c730e347ff99595854be30b7", "length of marathon_minutes.shape is not correct"
assert sha1(str(sorted(map(str, marathon_minutes.shape))).encode("utf-8")+b"f33f51462f3e9b3d").hexdigest() == "ec65178fb3d7e0378946ef77142f82e001841c74", "values of marathon_minutes.shape are not correct"
assert sha1(str(marathon_minutes.shape).encode("utf-8")+b"f33f51462f3e9b3d").hexdigest() == "44423375bcc4f3721665025f113ae3ac42efe791", "order of elements of marathon_minutes.shape is not correct"

assert sha1(str(type(sum(marathon_minutes.km10_time_seconds.dropna()))).encode("utf-8")+b"9a3ad9c2ab6a7f72").hexdigest() == "77a461477e41248f1fbf0aa19e547a824d2a17a5", "type of sum(marathon_minutes.km10_time_seconds.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_minutes.km10_time_seconds.dropna()), 2)).encode("utf-8")+b"9a3ad9c2ab6a7f72").hexdigest() == "a1401fed307a4e260a6fdb801cebb9919bb35114", "value of sum(marathon_minutes.km10_time_seconds.dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')

### 7.5. Visualization
`Altair` is powerful visualization package for Python. The fundamental object in `Altair` is the `Chart`, which takes a data frame as a single argument `alt.Chart(dataframe)`. With a chart object in hand, we can now specify how we would like the data to be visualized. We first indicate what kind of graphical mark we want to use to represent the data. We can set the mark attribute of the chart object using the the `Chart.mark_*` methods. The `encode` method builds a mapping between visual encoding channels (such as x, y, color, shape, size, etc.) and columns in the dataset.

![ws1_ggplot_male_py.png](images/ws1_ggplot_male_py.png)

Let's plot a scatterplot with the `bmi` on the x axis and `km10_time_minutes` on the y axis.

Before we start plotting use `Altair`, we need to import the package. You'll see we give it the alias `alt`.

In [None]:
import altair as alt

In [None]:
# Run this cell to create a scatterplot of BMI against the time it took to run 10 km.

plot = alt.Chart(marathon_minutes).mark_point().encode(
    x="bmi",
    y="km10_time_minutes"
)
plot

**Question 7.6.1** Multiple Choice
<br> {points: 1}

Looking at the graph above, choose a statement above that most reflects what we see.

A. There appears to be no relationship between 10 km run time and body mass index. As the value for body mass index increases we see neither an increase nor decrease in the time it takes to run 10 km.

B. There may be a positive relationship between 10 km run time and body mass index. As the value for body mass index increases, so does the time it takes to run 10 km.

C. There may be a negative relationship between 10 km run time and body mass index. As the value for body mass index increases, the time it takes to run 10 km decreases.




*Assign your answer to an object called `answer7_6_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer7_6_1)).encode("utf-8")+b"9d289d4e70041648").hexdigest() == "d9e81fed8196db2066dff398e0ef18ad530fdc78", "type of answer7_6_1 is not str. answer7_6_1 should be an str"
assert sha1(str(len(answer7_6_1)).encode("utf-8")+b"9d289d4e70041648").hexdigest() == "0cc9f1d75dd257df7484035e46364a36f4274609", "length of answer7_6_1 is not correct"
assert sha1(str(answer7_6_1.lower()).encode("utf-8")+b"9d289d4e70041648").hexdigest() == "3180b5fba6c0e9f0047f32ceb637feb18e618f27", "value of answer7_6_1 is not correct"
assert sha1(str(answer7_6_1).encode("utf-8")+b"9d289d4e70041648").hexdigest() == "132c7966bf66bcaad443ffe2358d6d6a2119f9c2", "correct string value of answer7_6_1 but incorrect case of letters"

print('Success!')

The visualization code above barely scratches the surface of what `Altair`, and Python as a whole, are capable of. Not only are there far more choices about the kinds of plots available, but there are many, many options for customizing the look and feel of each graph. You can choose the font, the font size, the colors, the style of the axes, etc. 

Let’s dig a little deeper into just a couple of options that you can add to any of your graphs to make them look a little better. For example, you can change the text of the x-axis label or the y-axis label by specifying the title inside `alt.X()` or `alt.Y()` inside the encoder. You can also change the font size using the `configure_axis` method. Let’s do that for the scatterplot to make the labels easier to read.

In [None]:
# Run this cell.
# You can replace the axes with whatever you wish to label.
# After running the cell once, try changing the axes to something else.

marathon_plot = alt.Chart(marathon_minutes).mark_point().encode(
    x=alt.X("bmi").title("Body Mass Index"),
    y=alt.Y("km10_time_minutes").title("10 km run time (minutes)"),
).configure_axis(
    labelFontSize=12,
    titleFontSize=12,
)
marathon_plot

## Attributions
- UC Berkley [Data 8 Public Materials](https://github.com/data-8/data8assets)
- UBC [Key Capabilities in Data Science Programming in Python course](https://github.com/UBC-MDS/prog-python-data-science-students)