# Worksheet 6 - Classification

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* Recognize situations where a simple classifier would be appropriate for making predictions.
* Explain the $K$-nearest neighbour classification algorithm.
* Interpret the output of a classifier.
* Compute, by hand, the distance between points when there are two explanatory variables/predictors.
* Describe what a training data set is and how it is used in classification.
* Given a dataset with two explanatory variables/predictors, use $K$-nearest neighbour classification in Python using the `scikit-learn` framework to predict the class of a single new observation.

This worksheet covers parts of [Chapter 5](https://python.datasciencebook.ca/classification1.html) of the online textbook. You should read this chapter before attempting this assignment. Any place you see `___`, you must fill in the function, variable, or data to complete the code. Substitute the `raise NotImplementedError` with your completed code and answers then proceed to run the cell.

In [None]:
### Run this cell before continuing
import random

import altair as alt
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import set_config

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

**Question 0.1** Multiple Choice: 
<br> {points: 1}

**Which of the following statements is NOT true of a training data set (in the context of classification)?**

A. A training data set is a collection of observations for which we know the true classes.

B. We can use a training set to explore and build our classifier.

C. The training data set is the underlying collection of observations for which we don't know the true classes.

*Assign your answer to an object called `answer0_1`. Make sure the correct answer is an uppercase letter. Remember to surround your answer with quotation marks (e.g. "D").*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_1)).encode("utf-8")+b"0922a816cf196633").hexdigest() == "7a7186fe1cf75dcd6de3adb41670f0725ba5325c", "type of answer0_1 is not str. answer0_1 should be an str"
assert sha1(str(len(answer0_1)).encode("utf-8")+b"0922a816cf196633").hexdigest() == "42d268300fab3fd91faca73141d94be12d990d4d", "length of answer0_1 is not correct"
assert sha1(str(answer0_1.lower()).encode("utf-8")+b"0922a816cf196633").hexdigest() == "c4290b936d27287c66cb04a4cfd5b270eea7be66", "value of answer0_1 is not correct"
assert sha1(str(answer0_1).encode("utf-8")+b"0922a816cf196633").hexdigest() == "c577805e1b31a5a67ef1b3f2e13377d70511d06c", "correct string value of answer0_1 but incorrect case of letters"

print('Success!')

**Question 0.2** Multiple Choice
<br> {points: 1}

(Adapted from James et al, "[An introduction to statistical learning](http://www-bcf.usc.edu/~gareth/ISL/)" (page 53))

Consider the scenario below: 

We collect data on 20 similar products. For each product we have recorded whether it was a success or failure (labelled as such by the Sales team), price charged for the product, marketing budget, competition price, customer data, and ten other variables. 

**Which of the following is a classification problem?**

A. We are interested in comparing the profit margins for products that are a success and products that are a failure. 

B. We are considering launching a new product and wish to know whether it will be a success or a failure. 

C. We wish to group customers based on their preferences and use that knowledge to develop targeted marketing programs. 

*Assign your answer to an object called `answer0_2`. Make sure the correct answer is an uppercase letter. Remember to surround your answer with quotation marks (e.g. "F").*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_2)).encode("utf-8")+b"294ef113fd83ab9e").hexdigest() == "d52fa700a5d595bb563e0e6d8dbeded4fc435d6d", "type of answer0_2 is not str. answer0_2 should be an str"
assert sha1(str(len(answer0_2)).encode("utf-8")+b"294ef113fd83ab9e").hexdigest() == "224cc912b59c462ab1a230069ee3b137f5845127", "length of answer0_2 is not correct"
assert sha1(str(answer0_2.lower()).encode("utf-8")+b"294ef113fd83ab9e").hexdigest() == "162e498efe9cfe3cc9ab8195f434c10ea5c548bb", "value of answer0_2 is not correct"
assert sha1(str(answer0_2).encode("utf-8")+b"294ef113fd83ab9e").hexdigest() == "ec5fbe2c6dd6097cfa08dd5a3dd4a4907ca8f4b4", "correct string value of answer0_2 but incorrect case of letters"

print('Success!')

## 1. Breast Cancer Data Set 
We will work with the breast cancer data from this week's pre-reading. 
> Note that the breast cancer data in this worksheet have been **standardized (centred and scaled)** for you already. We will implement these steps in future worksheets/tutorials later, but for now, know the data has been standardized. Therefore the variables are unitless and hence why we have zero and negative values for variables like Radius. 

**Question 1.0**
<br> {points: 1}

Read the `clean-wdbc-data.csv` file (found in the `data` directory) using the `pd.read_csv` function into the notebook and store it as a data frame. *Name it `cancer`.*

In [None]:
# your code here
raise NotImplementedError
cancer

In [None]:
from hashlib import sha1
assert sha1(str(type(cancer is None)).encode("utf-8")+b"3345d53c2493d8bc").hexdigest() == "7dead76496fa74b0076e4e8ba5e4299c332a17b5", "type of cancer is None is not bool. cancer is None should be a bool"
assert sha1(str(cancer is None).encode("utf-8")+b"3345d53c2493d8bc").hexdigest() == "954aab458b59c3d2406aa623b7cbd1bae0a2cf60", "boolean value of cancer is None is not correct"

assert sha1(str(type(cancer)).encode("utf-8")+b"b35fba51e2036ef5").hexdigest() == "5adf2aa8a4f6cd0ba0a907d031eb893b6556d162", "type of type(cancer) is not correct"

assert sha1(str(type(cancer.shape)).encode("utf-8")+b"5df59530f1b8c114").hexdigest() == "8ca5d074e209d153ba1ade68e78ff347e2b4eb2b", "type of cancer.shape is not tuple. cancer.shape should be a tuple"
assert sha1(str(len(cancer.shape)).encode("utf-8")+b"5df59530f1b8c114").hexdigest() == "dc0eb374a2d5de61cbbc94bffcaab501c51d085d", "length of cancer.shape is not correct"
assert sha1(str(sorted(map(str, cancer.shape))).encode("utf-8")+b"5df59530f1b8c114").hexdigest() == "edc2b0d5ab522caefbb7e156b0a55dc0e7c003fb", "values of cancer.shape are not correct"
assert sha1(str(cancer.shape).encode("utf-8")+b"5df59530f1b8c114").hexdigest() == "a59235a621dd8907bc1aeb444d503eac2b149688", "order of elements of cancer.shape is not correct"

assert sha1(str(type(sum(cancer.Area))).encode("utf-8")+b"faff98a990582f3c").hexdigest() == "119fc937bc88ef6076993c5a3577af4c17bb5130", "type of sum(cancer.Area) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(cancer.Area), 2)).encode("utf-8")+b"faff98a990582f3c").hexdigest() == "aa7bd09472416b4719e3e4337e435ac4f59dd537", "value of sum(cancer.Area) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(cancer.columns.values)).encode("utf-8")+b"48d6153fe52449fc").hexdigest() == "a40bb25cc12434cce5b768812d81e5609076abec", "type of cancer.columns.values is not correct"
assert sha1(str(cancer.columns.values).encode("utf-8")+b"48d6153fe52449fc").hexdigest() == "975fcd0adee412d8a29e7f91cb61ec9924d89b3d", "value of cancer.columns.values is not correct"

assert sha1(str(type(cancer['Class'].dtype)).encode("utf-8")+b"44a075b8fc578cf6").hexdigest() == "29ebae615e5458643c89b7cc233f044e060a432c", "type of cancer['Class'].dtype is not correct"
assert sha1(str(cancer['Class'].dtype).encode("utf-8")+b"44a075b8fc578cf6").hexdigest() == "26c2bc6c5569cfd57d41ef97d72b1b0272fa940e", "value of cancer['Class'].dtype is not correct"

print('Success!')

**Question 1.1** True or False: 
<br> {points: 1}

After looking at the first six rows of the `cancer` data fame, suppose we asked you to predict the variable "area" for a new observation. **Is this a classification problem?**

*Assign your answer to an object called `answer1_1`. Make sure the correct answer is a boolean. i.e. `True` or `False`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_1)).encode("utf-8")+b"47e792ad3f733f74").hexdigest() == "b1b94a968686b1a6d222e3777769966cee244dfd", "type of answer1_1 is not bool. answer1_1 should be a bool"
assert sha1(str(answer1_1).encode("utf-8")+b"47e792ad3f733f74").hexdigest() == "a3b9cbd6c938b4a3b2ea589cba5285d780f03680", "boolean value of answer1_1 is not correct"

print('Success!')

**Question 1.2** 
<br> {points: 1}

Create a scatterplot of the data with `Symmetry` on the x-axis and `Radius` on the y-axis. Modify your aesthetics by colouring for `Class`. As you create this plot, ensure you follow the guidelines for creating effective visualizations. In particular, note in the chart axis titles whether the data is standardized or not and add a suitable opacity level to the graphical mark. You should also replace the values in the dataframe's `Class` column from `'M'` to `'Malignant'` and from `'B'` to `'Benign'`. 

*Assign your plot to an object called `cancer_plot`.*

In [None]:
# cancer["Class"] = cancer["Class"].___({
#     ___,
#     ___
# })
#
# cancer_plot = ___

# your code here
raise NotImplementedError
cancer_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(cancer_plot is None)).encode("utf-8")+b"c68156e9666a9559").hexdigest() == "1d718ed8dac5ad841fc42c61b440f2ed4943f2e0", "type of cancer_plot is None is not bool. cancer_plot is None should be a bool"
assert sha1(str(cancer_plot is None).encode("utf-8")+b"c68156e9666a9559").hexdigest() == "50a6b81678e23768767853679282102a2165ae61", "boolean value of cancer_plot is None is not correct"

assert sha1(str(type(cancer_plot.encoding.x['shorthand'])).encode("utf-8")+b"b723dc42d985afb5").hexdigest() == "10dfe32c1229e6af78febea416eddb6ee80da9e9", "type of cancer_plot.encoding.x['shorthand'] is not str. cancer_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(cancer_plot.encoding.x['shorthand'])).encode("utf-8")+b"b723dc42d985afb5").hexdigest() == "2b656fb988a8885e056107729ac5da56961a6d25", "length of cancer_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"b723dc42d985afb5").hexdigest() == "9ad4c536c4bff0e1d8810134224b2087f47c927a", "value of cancer_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.x['shorthand']).encode("utf-8")+b"b723dc42d985afb5").hexdigest() == "41677e7c0bf534f46fd18eec1977c1285f94fc0d", "correct string value of cancer_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(cancer_plot.encoding.y['shorthand'])).encode("utf-8")+b"13b66f7dd2b3128c").hexdigest() == "4e5506a4b09b38af366c6e6eaf51c402487da08e", "type of cancer_plot.encoding.y['shorthand'] is not str. cancer_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(cancer_plot.encoding.y['shorthand'])).encode("utf-8")+b"13b66f7dd2b3128c").hexdigest() == "ecdbb2c253f76d9ded3e9fb606cfaff8654b9b3c", "length of cancer_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"13b66f7dd2b3128c").hexdigest() == "df57a9fe4c11f95e910c966a0f089724910316df", "value of cancer_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.y['shorthand']).encode("utf-8")+b"13b66f7dd2b3128c").hexdigest() == "34fd7b476dffb451a664f2b81b861f5e8a81a81b", "correct string value of cancer_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(cancer_plot.encoding.color['shorthand'])).encode("utf-8")+b"0add9a3b9dd8f509").hexdigest() == "ee35e5bbd8a16f85ba06b817338d0f5eb675d550", "type of cancer_plot.encoding.color['shorthand'] is not str. cancer_plot.encoding.color['shorthand'] should be an str"
assert sha1(str(len(cancer_plot.encoding.color['shorthand'])).encode("utf-8")+b"0add9a3b9dd8f509").hexdigest() == "66d8b48ad8bca0ad669012143480bcffd10a4dc5", "length of cancer_plot.encoding.color['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.color['shorthand'].lower()).encode("utf-8")+b"0add9a3b9dd8f509").hexdigest() == "ae39e5cdbcf17509ad33debfa30c4326b5534bce", "value of cancer_plot.encoding.color['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.color['shorthand']).encode("utf-8")+b"0add9a3b9dd8f509").hexdigest() == "1599e9f368d40a9346cc54a5bcf866e09ab7e39f", "correct string value of cancer_plot.encoding.color['shorthand'] but incorrect case of letters"

assert sha1(str(type(cancer_plot.mark)).encode("utf-8")+b"10b881dc82da953f").hexdigest() == "e870e4f4db2d928c1b66550a160ce69b0ffc1d5e", "type of cancer_plot.mark is not correct"
assert sha1(str(cancer_plot.mark).encode("utf-8")+b"10b881dc82da953f").hexdigest() == "4b15ffbbfc6ccc1a282ccb26ca81c64f4ad01e86", "value of cancer_plot.mark is not correct"

assert sha1(str(type(isinstance(cancer_plot.encoding.color['title'], str))).encode("utf-8")+b"b55706e808295ddd").hexdigest() == "1ef10d38a7189b3857ffc54ac10354434ef76ebe", "type of isinstance(cancer_plot.encoding.color['title'], str) is not bool. isinstance(cancer_plot.encoding.color['title'], str) should be a bool"
assert sha1(str(isinstance(cancer_plot.encoding.color['title'], str)).encode("utf-8")+b"b55706e808295ddd").hexdigest() == "685c0e45cc17fba743f5530fa1e72bba07ff8012", "boolean value of isinstance(cancer_plot.encoding.color['title'], str) is not correct"

assert sha1(str(type(isinstance(cancer_plot.encoding.x['title'], str))).encode("utf-8")+b"bab9a40ac336ce11").hexdigest() == "bfc0da5f5108797d6555f499d61ac3382f8ccd69", "type of isinstance(cancer_plot.encoding.x['title'], str) is not bool. isinstance(cancer_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(cancer_plot.encoding.x['title'], str)).encode("utf-8")+b"bab9a40ac336ce11").hexdigest() == "4a6c7a4869a4ada5504531acef499e985700063f", "boolean value of isinstance(cancer_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(cancer_plot.encoding.y['title'], str))).encode("utf-8")+b"79199b655fcc004b").hexdigest() == "9cf0401f8e1ef3d02937261c3801e1da9b946b7d", "type of isinstance(cancer_plot.encoding.y['title'], str) is not bool. isinstance(cancer_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(cancer_plot.encoding.y['title'], str)).encode("utf-8")+b"79199b655fcc004b").hexdigest() == "8bdc1b32ddbd41484a5b8c70c7aff6284b4d5290", "boolean value of isinstance(cancer_plot.encoding.y['title'], str) is not correct"

print('Success!')

**Question 1.3** 
<br> {points: 1}

Just by looking at the scatterplot above, how would you classify an observation with `Symmetry` = 1 and `Radius` = 1 (Benign or Malignant)?

*Assign your answer to an object called `answer1_3`. Make sure the correct answer is written fully. Remember to surround your answer with quotation marks (e.g. "Benign" / "Malignant").*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_3)).encode("utf-8")+b"31d6c4aa0e0193c0").hexdigest() == "34abf93d3b371e23ad1b6c05ffff99053e29a65d", "type of answer1_3 is not str. answer1_3 should be an str"
assert sha1(str(len(answer1_3)).encode("utf-8")+b"31d6c4aa0e0193c0").hexdigest() == "21a94768e54ec4c3b846ec9145448cdba35bbc54", "length of answer1_3 is not correct"
assert sha1(str(answer1_3.lower()).encode("utf-8")+b"31d6c4aa0e0193c0").hexdigest() == "2f262c00544da49ab5bce6f264c73b05707a11f1", "value of answer1_3 is not correct"
assert sha1(str(answer1_3).encode("utf-8")+b"31d6c4aa0e0193c0").hexdigest() == "33a9af6268f7284cfe1f9710cd978ecba2c8faa0", "correct string value of answer1_3 but incorrect case of letters"

print('Success!')

We will now compute the distance between the first and second observation in the breast cancer dataset using the explanatory variables/predictors `Symmetry` and `Radius`. Recall we can calculate the distance between two points using the following formula:

$$Distance = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$

**Question 1.4** 
<br> {points: 1}

First, extract the coordinates for the two observations and assign them to objects called: 

- `ax` (Symmetry value for the first row)
- `ay` (Radius value for the first row)
- `bx` (Symmetry value for the second row)
- `by` (Radius value for the second row).

*Scaffolding for `ax` is given*.

In [None]:
# ax = cancer.loc[0, "Symmetry"] # selecting the first observation from cancer and pulling the value from the Symmetry column as a numeric value only

# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(ax)).encode("utf-8")+b"350984883bf7eb25").hexdigest() == "77d2f5bba8c94c0a281e8d145d2a48f493aa53f2", "type of ax is not correct"
assert sha1(str(ax).encode("utf-8")+b"350984883bf7eb25").hexdigest() == "f6bddf3e5eadb5688750a34cd179a9e94af8dbd5", "value of ax is not correct"

assert sha1(str(type(ay)).encode("utf-8")+b"645cae1877af771e").hexdigest() == "a8c66ea291a84661dba1d650d5ff267ed773a3da", "type of ay is not correct"
assert sha1(str(ay).encode("utf-8")+b"645cae1877af771e").hexdigest() == "d39f3dbc9c4feb23a8d863c0b8bd28bbc8f02d06", "value of ay is not correct"

assert sha1(str(type(bx)).encode("utf-8")+b"dd07c863f264ab5c").hexdigest() == "d0c91f0d6049262650dbf97597508ab9572cc0c9", "type of bx is not correct"
assert sha1(str(bx).encode("utf-8")+b"dd07c863f264ab5c").hexdigest() == "0e1b41dc0059fbd99fcf22de944fb325a5e4b099", "value of bx is not correct"

assert sha1(str(type(by)).encode("utf-8")+b"31ae037efd3908cd").hexdigest() == "680f71b965a755f3760e6942416c07f95779fccc", "type of by is not correct"
assert sha1(str(by).encode("utf-8")+b"31ae037efd3908cd").hexdigest() == "a1e2a6df19eb1970993763e9daeb868ddc142aac", "value of by is not correct"

print('Success!')

**Question 1.5**
<br> {points: 1}

Plug the coordinates into the distance equation. Hint: `**` is the exponent symbol in Python.

*Assign your answer to an object called `answer1_5`.*

Fill in the `___` in the cell below. 

In [None]:
# ___ = ((___ - ___)**___ + (___ - ___)**___)**___

# your code here
raise NotImplementedError
answer1_5

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_5)).encode("utf-8")+b"dfe4868569a4a064").hexdigest() == "1980f143bbcb009f6f38283e491495712d4b3896", "type of answer1_5 is not correct"
assert sha1(str(answer1_5).encode("utf-8")+b"dfe4868569a4a064").hexdigest() == "9993c9ddcf879707d33ddb3631f6d0ffddc07bc3", "value of answer1_5 is not correct"

print('Success!')

**Question 1.6**
<br> {points: 1}

Now we'll do the same thing *with 3 explanatory variables/predictors*: `Symmetry`, `Radius` and `Concavity`. Again, use the first two rows in the data set as the points you are calculating the distance between (point $a$ is row 0, and point $b$ is row 1).


*Find the coordinates for the third variable (Concavity) and assign them to objects called `za` and `zb`. Use the scaffolding given in **Question 1.4** as a guide.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(az)).encode("utf-8")+b"0bc3057942d91fa8").hexdigest() == "8fcc9383ab6e619697382a63e13c1e423cd2a24c", "type of az is not correct"
assert sha1(str(az).encode("utf-8")+b"0bc3057942d91fa8").hexdigest() == "73163dd5890cab75bc681517e063e726dc98b04e", "value of az is not correct"

assert sha1(str(type(bz)).encode("utf-8")+b"ec6229535ca2e62b").hexdigest() == "4ed997edf32c603dd4d1eca97a26a113d1039b7f", "type of bz is not correct"
assert sha1(str(bz).encode("utf-8")+b"ec6229535ca2e62b").hexdigest() == "95953c80f3d83d0c36c07f6d78a11c3bbd6e9f7b", "value of bz is not correct"

print('Success!')

**Question 1.7**
<br> {points: 1}

Again, calculate the distance between the first and second observation in the breast cancer dataset using 3 explanatory variables/predictors: `Symmetry`, `Radius` and `Concavity`.

*Assign your answer to an object called `answer1_7`. Use the scaffolding given to calculate `answer1_5` as a guide.*

In [None]:
# your code here
raise NotImplementedError
answer1_7

In [None]:
from hashlib import sha1
assert sha1(str(type(round(answer1_7, 2))).encode("utf-8")+b"ecddd317dba28bca").hexdigest() == "a2036c7fa281c9be6bca9f9a1c6012c79839c0d6", "type of round(answer1_7, 2) is not correct"
assert sha1(str(round(answer1_7, 2)).encode("utf-8")+b"ecddd317dba28bca").hexdigest() == "e134a25a5a4a16afaf8cf82b7fa94697edfa0def", "value of round(answer1_7, 2) is not correct"

print('Success!')

**Question 1.8**
<br> {points: 1}

Let's do this without explicitly making coordinate variables!

Create a pandas series of the coordinates for each point. Name one series `point_a` and the other series `point_b`. Within the series, the order of coordinates should be: `Symmetry`, `Radius`, `Concavity`.

Fill in the `___` in the cell below.


In [None]:
# This is only the scaffolding for one point (you need to make another one for row number 1)

# ___ =  cancer.loc[0, [___, "Radius", ___]]


# your code here
raise NotImplementedError
print(point_a)
print(point_b)

In [None]:
from hashlib import sha1
assert sha1(str(type(round(sum(point_a), 2))).encode("utf-8")+b"a59291c63648b5d5").hexdigest() == "73048805bb31ef97f846eb2f0ebc5b934a218519", "type of round(sum(point_a), 2) is not correct"
assert sha1(str(round(sum(point_a), 2)).encode("utf-8")+b"a59291c63648b5d5").hexdigest() == "b7906f5275e57575a86b8a6ecbcc3f52419205dc", "value of round(sum(point_a), 2) is not correct"

assert sha1(str(type(round(sum(point_b), 2))).encode("utf-8")+b"18f2383163960665").hexdigest() == "feb551333906ed4bad1672fcb389950ff6ba1a81", "type of round(sum(point_b), 2) is not correct"
assert sha1(str(round(sum(point_b), 2)).encode("utf-8")+b"18f2383163960665").hexdigest() == "1dcef47e5659027e457cfac45ed84fb45ebb17f4", "value of round(sum(point_b), 2) is not correct"

print('Success!')

**Question 1.9**
<br> {points: 1}

Compute the squared differences between the two series, `point_a` and `point_b`. The result should be a series of length 3 named `dif_square`. *Hint: `**` is the exponent symbol in Python.*

In [None]:
# your code here
raise NotImplementedError
dif_square

In [None]:
from hashlib import sha1
assert sha1(str(type(sum(dif_square))).encode("utf-8")+b"43547ef720487c14").hexdigest() == "e59bc16499d28470c94a4fe8f569968a7c3a4380", "type of sum(dif_square) is not correct"
assert sha1(str(sum(dif_square)).encode("utf-8")+b"43547ef720487c14").hexdigest() == "f3845ec909a9f4a0270da3cfbcd36139f02da152", "value of sum(dif_square) is not correct"

print('Success!')

**Question 1.09.1**
<br> {points: 1}

Sum the squared differences between the two points, `point_a` and `point_b`. The result should be a single number named `dif_sum`. 

*Hint: the `sum` dataframe method in Python returns the sum of the elements in a series*

In [None]:
# your code here
raise NotImplementedError
dif_sum

In [None]:
from hashlib import sha1
assert sha1(str(type(dif_sum)).encode("utf-8")+b"a2f00a4de3d3eaf6").hexdigest() == "010668de05039ce5a12de5157d62ed17e46b86b4", "type of dif_sum is not correct"
assert sha1(str(dif_sum).encode("utf-8")+b"a2f00a4de3d3eaf6").hexdigest() == "42c96c45a7682f183f0a27a24379af4961408537", "value of dif_sum is not correct"

print('Success!')

**Question 1.09.2**
<br> {points: 1}

Square root the sum of your squared differences. The result should be a number named `root_dif_sum`. 

In [None]:
# your code here
raise NotImplementedError
root_dif_sum

In [None]:
from hashlib import sha1
assert sha1(str(type(root_dif_sum)).encode("utf-8")+b"464507cc52d4cc88").hexdigest() == "61a4238e63e64dfc2c41efbe53127e4d59654c6a", "type of root_dif_sum is not correct"
assert sha1(str(root_dif_sum).encode("utf-8")+b"464507cc52d4cc88").hexdigest() == "5a56b33176c0c1573ea9ad2e49c4e1dc07cf87f7", "value of root_dif_sum is not correct"

print('Success!')

**Question 1.09.3**
<br> {points: 1}

If we have more than a few points, calculating distances as we did in the previous questions is VERY tedious. Let's use the `euclidean_distances` function from `scikit-learn` package to find the distance between the first and second observation in the breast cancer dataset using Symmetry, Radius and Concavity. 

Fill in the `___` in the cell below. 

*Assign your answer to an object called `dist_cancer_two_rows`. Note that the euclidean_distances function will return four values, one for each pair of combinations of the coordinates for the two points; we will see another example of this and describe it more in detail later in this worksheet.*

In [None]:
# dist_cancer_two_rows = euclidean_distances(
#     cancer.loc[0:1, [___, "Radius", ___]]
# )

# your code here
raise NotImplementedError
dist_cancer_two_rows

In [None]:
from hashlib import sha1
assert sha1(str(type(dist_cancer_two_rows[0, 0])).encode("utf-8")+b"0f6be0e8648b9928").hexdigest() == "33287f775f6884d72f32046ca74ff6b0df65d03f", "type of dist_cancer_two_rows[0, 0] is not correct"
assert sha1(str(dist_cancer_two_rows[0, 0]).encode("utf-8")+b"0f6be0e8648b9928").hexdigest() == "afa1938357948c19f60a1211d6db6268d92673a1", "value of dist_cancer_two_rows[0, 0] is not correct"

assert sha1(str(type(dist_cancer_two_rows[0, 1])).encode("utf-8")+b"546887dd6e0d4a21").hexdigest() == "3933daf9b5940b42d2952804e28e267bc0b53b93", "type of dist_cancer_two_rows[0, 1] is not correct"
assert sha1(str(dist_cancer_two_rows[0, 1]).encode("utf-8")+b"546887dd6e0d4a21").hexdigest() == "e7e1be552c69c541fcd9e24365397b255f535bf3", "value of dist_cancer_two_rows[0, 1] is not correct"

assert sha1(str(type(dist_cancer_two_rows[1, 0])).encode("utf-8")+b"6f74c27b149d0d0d").hexdigest() == "6f14a8231431802784d66a8cba1087db70d2d5ef", "type of dist_cancer_two_rows[1, 0] is not correct"
assert sha1(str(dist_cancer_two_rows[1, 0]).encode("utf-8")+b"6f74c27b149d0d0d").hexdigest() == "bc96591c9d8b3e2573c14ded58687e65460caa22", "value of dist_cancer_two_rows[1, 0] is not correct"

assert sha1(str(type(dist_cancer_two_rows[1, 1])).encode("utf-8")+b"35651338e8717362").hexdigest() == "e3f354626a44e893ea9d8d9a3655aabba0c2b0b7", "type of dist_cancer_two_rows[1, 1] is not correct"
assert sha1(str(dist_cancer_two_rows[1, 1]).encode("utf-8")+b"35651338e8717362").hexdigest() == "c9dc2f4bd88ea94149b198bca2837f190fa8cc8f", "value of dist_cancer_two_rows[1, 1] is not correct"

print('Success!')

**Question 1.09.4** True or False: 
<br> {points: 1}

Compare `answer1_7`, `root_dif_sum`, and `dist_cancer_two_rows`. 

**Are they all the same value?** 

*Assign your answer to an object called `answer1_09_4`. Make sure the correct answer is a boolean. i.e. `True` or `False`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_09_4)).encode("utf-8")+b"c504f1e2fd2e5074").hexdigest() == "f3f6af00499f5744d77703bffcc58ed57c4a0ebf", "type of answer1_09_4 is not bool. answer1_09_4 should be a bool"
assert sha1(str(answer1_09_4).encode("utf-8")+b"c504f1e2fd2e5074").hexdigest() == "f835dab6742746098249e67ebf04fe96e82c8928", "boolean value of answer1_09_4 is not correct"

print('Success!')

## 2. Classification - A Simple Example Done Manually

**Question 2.0.0**
<br> {points: 1}

Let's take a random sample of 5 observations from the breast cancer dataset using the `sample` dataframe method. To make this random sample reproducible, we will use `random_state=20` inside the `sample` function. This means that the random number generator will start at the same point each time when we run the code and we will always get back the same random samples. 

We will focus on the predictors Symmetry and Radius only. Thus, we will need to select the columns `Symmetry` and `Radius` and `Class`. Save these 5 rows and 3 columns to a data frame named `small_sample`.

Fill in the `___` in the scaffolding provided below.

In [None]:
# ___ = cancer.sample(5, random_state=20)[[___, ___, ___]]

# your code here
raise NotImplementedError
small_sample

In [None]:
from hashlib import sha1
assert sha1(str(type(small_sample is None)).encode("utf-8")+b"f8a0c939c599c213").hexdigest() == "c2f54f99b118f2723a1788b158891e333b1a7e0e", "type of small_sample is None is not bool. small_sample is None should be a bool"
assert sha1(str(small_sample is None).encode("utf-8")+b"f8a0c939c599c213").hexdigest() == "d80386be34bae3d6dede813efe18c1cb8c6272f7", "boolean value of small_sample is None is not correct"

assert sha1(str(type(small_sample)).encode("utf-8")+b"e0dfb2d53ffbda6b").hexdigest() == "4b61101d3e730c5b68960e7b7099bb0238dc9bd5", "type of type(small_sample) is not correct"

assert sha1(str(type(small_sample.shape)).encode("utf-8")+b"13098faa907155aa").hexdigest() == "75035a756f8f5d41398e8778b97388633d1bfaa6", "type of small_sample.shape is not tuple. small_sample.shape should be a tuple"
assert sha1(str(len(small_sample.shape)).encode("utf-8")+b"13098faa907155aa").hexdigest() == "7ee6199946381c56adae0fa2b3f6f81b98aa0cc2", "length of small_sample.shape is not correct"
assert sha1(str(sorted(map(str, small_sample.shape))).encode("utf-8")+b"13098faa907155aa").hexdigest() == "285bba0a09e1b3c0a9e812371ab179c3be36ba6c", "values of small_sample.shape are not correct"
assert sha1(str(small_sample.shape).encode("utf-8")+b"13098faa907155aa").hexdigest() == "4551f89efe9dddbf93ff6fdd3ec35c84abae4aee", "order of elements of small_sample.shape is not correct"

assert sha1(str(type("Symmetry" in small_sample.columns)).encode("utf-8")+b"067008e0e1b9a645").hexdigest() == "1dbfe8ffb7672011936b03729675eeb163572236", "type of \"Symmetry\" in small_sample.columns is not bool. \"Symmetry\" in small_sample.columns should be a bool"
assert sha1(str("Symmetry" in small_sample.columns).encode("utf-8")+b"067008e0e1b9a645").hexdigest() == "f2b450f70a8a15d641a60fce8622b86ed3b53e45", "boolean value of \"Symmetry\" in small_sample.columns is not correct"

assert sha1(str(type("Radius" in small_sample.columns)).encode("utf-8")+b"3e22a2c2890e75fb").hexdigest() == "401e2d04ae8a0709d3eef22e0dfbc2ab916ae57d", "type of \"Radius\" in small_sample.columns is not bool. \"Radius\" in small_sample.columns should be a bool"
assert sha1(str("Radius" in small_sample.columns).encode("utf-8")+b"3e22a2c2890e75fb").hexdigest() == "83dd8009ea4955a26cf410117888fe9dd9185ac5", "boolean value of \"Radius\" in small_sample.columns is not correct"

assert sha1(str(type("Class" in small_sample.columns)).encode("utf-8")+b"84e7776d3e64ffd3").hexdigest() == "bf155788fde591029d863a96c3640d814bab4d5a", "type of \"Class\" in small_sample.columns is not bool. \"Class\" in small_sample.columns should be a bool"
assert sha1(str("Class" in small_sample.columns).encode("utf-8")+b"84e7776d3e64ffd3").hexdigest() == "a7fab051a8d52005091db0add80cd6fa4d327917", "boolean value of \"Class\" in small_sample.columns is not correct"

print('Success!')

**Question 2.0.1**
<br> {points: 1}

Finally, create a scatter plot where `Symmetry` is on the x-axis, and `Radius` is on the y-axis. Color the points by `Class`. Name your plot `small_sample_plot`.

Fill in the `___` in the scaffolding provided below.

As you create this plot, ensure you follow the guidelines for creating effective visualizations. In particular, note on the plot axes whether the data is standardized or not. Don't set an opacity value in this chart.

In [None]:
# ___ = alt.Chart(___).mark_point().encode(
#     x=___,
#     y=___,
#     color=___
# )

# your code here
raise NotImplementedError
small_sample_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(small_sample_plot is None)).encode("utf-8")+b"78150ac68fcd8667").hexdigest() == "35ebff4db2972548ab42e6310df13fc58fb0caaf", "type of small_sample_plot is None is not bool. small_sample_plot is None should be a bool"
assert sha1(str(small_sample_plot is None).encode("utf-8")+b"78150ac68fcd8667").hexdigest() == "4b2cb8d4d51797e41b3b86bc5ffebbdccc3fc534", "boolean value of small_sample_plot is None is not correct"

assert sha1(str(type(sum(small_sample_plot.data.Symmetry))).encode("utf-8")+b"660c8d88d825df29").hexdigest() == "60ac348902f5046b527a1365c06940cbb8922284", "type of sum(small_sample_plot.data.Symmetry) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(small_sample_plot.data.Symmetry), 2)).encode("utf-8")+b"660c8d88d825df29").hexdigest() == "b5ebee3e76e8a7e947558283c639897995b50e32", "value of sum(small_sample_plot.data.Symmetry) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(small_sample_plot.mark)).encode("utf-8")+b"d1fa06f135c2f859").hexdigest() == "d1d567b12ca7f77b00bae67b718d77163be6acb9", "type of small_sample_plot.mark is not str. small_sample_plot.mark should be an str"
assert sha1(str(len(small_sample_plot.mark)).encode("utf-8")+b"d1fa06f135c2f859").hexdigest() == "feb199f58baf5d1b511751811c3d717d7a1bacb4", "length of small_sample_plot.mark is not correct"
assert sha1(str(small_sample_plot.mark.lower()).encode("utf-8")+b"d1fa06f135c2f859").hexdigest() == "29d409db8567e1c77f5510e95558014f9fb94285", "value of small_sample_plot.mark is not correct"
assert sha1(str(small_sample_plot.mark).encode("utf-8")+b"d1fa06f135c2f859").hexdigest() == "29d409db8567e1c77f5510e95558014f9fb94285", "correct string value of small_sample_plot.mark but incorrect case of letters"

assert sha1(str(type(small_sample_plot.encoding.color['shorthand'])).encode("utf-8")+b"b01a46f746b8cdbf").hexdigest() == "79a740bacf0da66828426c60882ed31acc0e7b1f", "type of small_sample_plot.encoding.color['shorthand'] is not str. small_sample_plot.encoding.color['shorthand'] should be an str"
assert sha1(str(len(small_sample_plot.encoding.color['shorthand'])).encode("utf-8")+b"b01a46f746b8cdbf").hexdigest() == "267d534f42308f8d0900d0718b02dc0d5039c065", "length of small_sample_plot.encoding.color['shorthand'] is not correct"
assert sha1(str(small_sample_plot.encoding.color['shorthand'].lower()).encode("utf-8")+b"b01a46f746b8cdbf").hexdigest() == "93c0be0eeab0bb3114a3f5d1fee1f40ba7b1af7b", "value of small_sample_plot.encoding.color['shorthand'] is not correct"
assert sha1(str(small_sample_plot.encoding.color['shorthand']).encode("utf-8")+b"b01a46f746b8cdbf").hexdigest() == "8c7e3eb118fe4b5d14b933b471ea64d4b1954e30", "correct string value of small_sample_plot.encoding.color['shorthand'] but incorrect case of letters"

assert sha1(str(type(small_sample_plot.encoding.x['shorthand'])).encode("utf-8")+b"5ffee55317901fb5").hexdigest() == "0c3147fd6f5cc58d0a8df4c46837d5828ab213fc", "type of small_sample_plot.encoding.x['shorthand'] is not str. small_sample_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(small_sample_plot.encoding.x['shorthand'])).encode("utf-8")+b"5ffee55317901fb5").hexdigest() == "1ae8d080b44c84a2c5b1a764563e4aa88a5c5561", "length of small_sample_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(small_sample_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"5ffee55317901fb5").hexdigest() == "205844d9f49897ca4516d034be9b945f62742613", "value of small_sample_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(small_sample_plot.encoding.x['shorthand']).encode("utf-8")+b"5ffee55317901fb5").hexdigest() == "f223a3cb7928bb6595585f49746031726e893336", "correct string value of small_sample_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(small_sample_plot.encoding.y['shorthand'])).encode("utf-8")+b"07eded4a2d1e5530").hexdigest() == "9e0472a9f384af80df0251bea47220c1a8f93006", "type of small_sample_plot.encoding.y['shorthand'] is not str. small_sample_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(small_sample_plot.encoding.y['shorthand'])).encode("utf-8")+b"07eded4a2d1e5530").hexdigest() == "6a2a2ede8534a14be265231a9d015bf62d8dcbce", "length of small_sample_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(small_sample_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"07eded4a2d1e5530").hexdigest() == "fd2cedd3b132e2eb820c193e5aa2e3f3359968d5", "value of small_sample_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(small_sample_plot.encoding.y['shorthand']).encode("utf-8")+b"07eded4a2d1e5530").hexdigest() == "e85de9a68545bb16b8b7efee601bc4ccca6c6a56", "correct string value of small_sample_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(isinstance(small_sample_plot.encoding.x['title'], str))).encode("utf-8")+b"a27f59cf45d3d99c").hexdigest() == "c18dfd7b0b908f9c4a7d608cc0fef52587398bce", "type of isinstance(small_sample_plot.encoding.x['title'], str) is not bool. isinstance(small_sample_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(small_sample_plot.encoding.x['title'], str)).encode("utf-8")+b"a27f59cf45d3d99c").hexdigest() == "64a1c96c70fc8dc9d0edf1a677082ad84ee9c4b0", "boolean value of isinstance(small_sample_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(small_sample_plot.encoding.y['title'], str))).encode("utf-8")+b"c72622c7e7894118").hexdigest() == "4dd96231b2c674b49cbf45d2d581cb1c78af4562", "type of isinstance(small_sample_plot.encoding.y['title'], str) is not bool. isinstance(small_sample_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(small_sample_plot.encoding.y['title'], str)).encode("utf-8")+b"c72622c7e7894118").hexdigest() == "464a81eb3be94f71331ce0a7eb6fe8fac0dd47b7", "boolean value of isinstance(small_sample_plot.encoding.y['title'], str) is not correct"

assert sha1(str(type(isinstance(small_sample_plot.encoding.color['title'], str))).encode("utf-8")+b"562c978cec8b204b").hexdigest() == "06cc44d0c4c517d42022062545b96cc1f9140d4c", "type of isinstance(small_sample_plot.encoding.color['title'], str) is not bool. isinstance(small_sample_plot.encoding.color['title'], str) should be a bool"
assert sha1(str(isinstance(small_sample_plot.encoding.color['title'], str)).encode("utf-8")+b"562c978cec8b204b").hexdigest() == "71af9ffd426cc0449d408c40d1311545f1b4e4f2", "boolean value of isinstance(small_sample_plot.encoding.color['title'], str) is not correct"

print('Success!')

**Question 2.1** 
<br> {points: 1}

Suppose we are interested in classifying a new observation with `Symmetry = 0.5` and `Radius = 0`, but unknown `Class`. Using the `small_sample` data frame, add another row with `Symmetry = 0.5`, `Radius = 0`, and `Class = "unknown"`.

Fill in the `___` in the scaffolding provided below.

*Assign your answer to an object called `updated_sample`.*

In [None]:
# new_observation = pd.DataFrame({"_____": [_____], "_____": [_____], "_____": [_____]})
# ______ = pd.concat([_____, _____]).reset_index(drop=True)

# your code here
raise NotImplementedError
updated_sample

In [None]:
from hashlib import sha1
assert sha1(str(type(updated_sample is None)).encode("utf-8")+b"b87f9fffa76a078e").hexdigest() == "3b440d13f0632d48cd6fa0ec1070b527692756a0", "type of updated_sample is None is not bool. updated_sample is None should be a bool"
assert sha1(str(updated_sample is None).encode("utf-8")+b"b87f9fffa76a078e").hexdigest() == "2e16cfdaca5a7281f8133e51cdf8f92b215b973b", "boolean value of updated_sample is None is not correct"

assert sha1(str(type(updated_sample)).encode("utf-8")+b"14c7c93f5334aec0").hexdigest() == "1bbd0731017ed3dd662e955957099973c9db49b5", "type of type(updated_sample) is not correct"

assert sha1(str(type(updated_sample.Class[5])).encode("utf-8")+b"9f5e51c66d646e1f").hexdigest() == "22a173f8bde3ef610ea30b34ff55dc27b14a311d", "type of updated_sample.Class[5] is not str. updated_sample.Class[5] should be an str"
assert sha1(str(len(updated_sample.Class[5])).encode("utf-8")+b"9f5e51c66d646e1f").hexdigest() == "ba795dab8cbdca71be8ec0db2eeaafa240796465", "length of updated_sample.Class[5] is not correct"
assert sha1(str(updated_sample.Class[5].lower()).encode("utf-8")+b"9f5e51c66d646e1f").hexdigest() == "eaa8a80d1ee18c18885e981943aba4907534bd48", "value of updated_sample.Class[5] is not correct"
assert sha1(str(updated_sample.Class[5]).encode("utf-8")+b"9f5e51c66d646e1f").hexdigest() == "eaa8a80d1ee18c18885e981943aba4907534bd48", "correct string value of updated_sample.Class[5] but incorrect case of letters"

assert sha1(str(type(updated_sample.shape)).encode("utf-8")+b"9ef5dc780ea68165").hexdigest() == "848d04e9f453e71c27508e1d6c2bbd066a6f68d2", "type of updated_sample.shape is not tuple. updated_sample.shape should be a tuple"
assert sha1(str(len(updated_sample.shape)).encode("utf-8")+b"9ef5dc780ea68165").hexdigest() == "ad741f60d42a5fbd7326beab62d726ba7e1c5778", "length of updated_sample.shape is not correct"
assert sha1(str(sorted(map(str, updated_sample.shape))).encode("utf-8")+b"9ef5dc780ea68165").hexdigest() == "fd9791e35f5dac904291a1588197e140c5c02cc6", "values of updated_sample.shape are not correct"
assert sha1(str(updated_sample.shape).encode("utf-8")+b"9ef5dc780ea68165").hexdigest() == "6b6a17121164f6ffc58821fa32fc0e92b75681c4", "order of elements of updated_sample.shape is not correct"

assert sha1(str(type(sum(updated_sample.Radius))).encode("utf-8")+b"6e5ed160fdbcb30b").hexdigest() == "4c4ad4480b5332a130538b1a8619692a4054388d", "type of sum(updated_sample.Radius) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(updated_sample.Radius), 2)).encode("utf-8")+b"6e5ed160fdbcb30b").hexdigest() == "2094b7f4c2c87a3402640c49c1f91473cb424bc7", "value of sum(updated_sample.Radius) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(updated_sample.Symmetry))).encode("utf-8")+b"39f16cb4eede5076").hexdigest() == "71fa607cc3f52e2430caf50185a37809b67e5964", "type of sum(updated_sample.Symmetry) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(updated_sample.Symmetry), 2)).encode("utf-8")+b"39f16cb4eede5076").hexdigest() == "1ec9817830cac2804dad451e4fe9d167230ecec9", "value of sum(updated_sample.Symmetry) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 2.2**
<br> {points: 1}

Compute the distance between each pair of the 6 observations in the `updated_sample` dataframe using the `euclidean_distances` function based on two variables: `Symmetry` and `Radius`. Fill in the `___` in the scaffolding provided below.


*Assign your answer to an object called `dist_matrix`.*

In [None]:
# dist_matrix = pd.DataFrame(___(updated_sample[[___, ___]]))

# your code here
raise NotImplementedError
dist_matrix

In [None]:
from hashlib import sha1
assert sha1(str(type(dist_matrix is None)).encode("utf-8")+b"09334fbb8b4119cf").hexdigest() == "e1a9748fea222d6cff35677851c848dd61c237db", "type of dist_matrix is None is not bool. dist_matrix is None should be a bool"
assert sha1(str(dist_matrix is None).encode("utf-8")+b"09334fbb8b4119cf").hexdigest() == "d6bb9f04436ce753f37eb4fdcd9fe2b47e72daa5", "boolean value of dist_matrix is None is not correct"

assert sha1(str(type(dist_matrix)).encode("utf-8")+b"31c81951d0934643").hexdigest() == "ace9c89e31883dbc601dd7f03ec43b0aea4fc593", "type of type(dist_matrix) is not correct"

assert sha1(str(type(dist_matrix.shape)).encode("utf-8")+b"7173b67f29c40c21").hexdigest() == "d183ad8941ca1b099b8ccde12a1bbd9cdb828525", "type of dist_matrix.shape is not tuple. dist_matrix.shape should be a tuple"
assert sha1(str(len(dist_matrix.shape)).encode("utf-8")+b"7173b67f29c40c21").hexdigest() == "a05cc9cd373c43847e41843aa21c36d2f8ae68c6", "length of dist_matrix.shape is not correct"
assert sha1(str(sorted(map(str, dist_matrix.shape))).encode("utf-8")+b"7173b67f29c40c21").hexdigest() == "c48854e2027660707dc5029dca38f67552d93392", "values of dist_matrix.shape are not correct"
assert sha1(str(dist_matrix.shape).encode("utf-8")+b"7173b67f29c40c21").hexdigest() == "953b25b6d9dd95327e29ae1c1ff90d08c2e52c7d", "order of elements of dist_matrix.shape is not correct"

assert sha1(str(type(sum(dist_matrix.iloc[0]))).encode("utf-8")+b"28ec6ae118431279").hexdigest() == "51f88702fb604a70bba9a991a996da38ab5443b2", "type of sum(dist_matrix.iloc[0]) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(dist_matrix.iloc[0]), 2)).encode("utf-8")+b"28ec6ae118431279").hexdigest() == "3802d973d0ff7de969d20ca4812c98531bae7c8e", "value of sum(dist_matrix.iloc[0]) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(dist_matrix.iloc[1]))).encode("utf-8")+b"e937ee92e10d9f05").hexdigest() == "01d553ea33678cce11a53627bbdd4068d78fd2f8", "type of sum(dist_matrix.iloc[1]) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(dist_matrix.iloc[1]), 2)).encode("utf-8")+b"e937ee92e10d9f05").hexdigest() == "3c3141fc0711e5f47bf2660cbc0b84c620e7e42c", "value of sum(dist_matrix.iloc[1]) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(dist_matrix.iloc[4]))).encode("utf-8")+b"185cd7e990bd0f45").hexdigest() == "520393bd775c1d5c0a0c7db6c5906c982e2fab5a", "type of sum(dist_matrix.iloc[4]) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(dist_matrix.iloc[4]), 2)).encode("utf-8")+b"185cd7e990bd0f45").hexdigest() == "4b23a7bfa116fe447d6f27dec29e2481a08001c1", "value of sum(dist_matrix.iloc[4]) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(dist_matrix.iloc[5]))).encode("utf-8")+b"7c66c4c43edc6793").hexdigest() == "df4770327ce1ae53acadebb1d99fb3ef6bbfde6c", "type of sum(dist_matrix.iloc[5]) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(dist_matrix.iloc[5]), 2)).encode("utf-8")+b"7c66c4c43edc6793").hexdigest() == "cd176dd33fbeac0b414794082f5c422d380f6c2b", "value of sum(dist_matrix.iloc[5]) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 2.3** Multiple Choice:
<br> {points: 1}

In the table above, the row and column numbers reflect the row number from the data frame the `euclidean_distances` function was applied to. The values in `dist_matrix` are the distances between the points of the row and column number. For example, the distance between the point 2 and point 4 is 2.444209. And the distance between point 3 and point 3 (the same point) is 0.

The diagonal is all zeros since the distance between a point and itself is always zero. Each pairwise distance value occurs two times in the table, once above and once below the diagonal of zeros, so technically it would be enough if we saved the diagonal plus either the values above or below, but here we keep all of them because it fits the table format nicely.

**Which observation is the nearest to our new point?**

*Assign your answer to an object called `answer2_3`. Make sure your answer is a number.

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_3)).encode("utf-8")+b"ac79f11262afc8ec").hexdigest() == "a6f41123d087ff5ce9b5b3b6ea7c80021da0c6b7", "type of answer2_3 is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(answer2_3).encode("utf-8")+b"ac79f11262afc8ec").hexdigest() == "d2209bd659f0ac5301c3029ada55710cbe57dda7", "value of answer2_3 is not correct"

print('Success!')

**Question 2.4** Multiple Choice: 
<br> {points: 1}

If we use the K-nearest neighbour classification algorithm with K = 1 to classify the new observation using your answers to **Questions 2.2 & 2.3**, is the new data point predicted to be benign or malignant?

*Assign your answer to an object called `answer2_4`. Make sure the correct answer is written fully as either "Benign" or "Malignant". Remember to surround your answer with quotation marks.* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_4)).encode("utf-8")+b"52023d14a47a9ea4").hexdigest() == "ea1e2969a2634ffdefde49b16e260aa2b8ff739d", "type of answer2_4 is not str. answer2_4 should be an str"
assert sha1(str(len(answer2_4)).encode("utf-8")+b"52023d14a47a9ea4").hexdigest() == "d6c1ea6d7b5d167423b0bd1087b6285ad91ce518", "length of answer2_4 is not correct"
assert sha1(str(answer2_4.lower()).encode("utf-8")+b"52023d14a47a9ea4").hexdigest() == "800fbf361cd5a8d4f90a71f888e0f94f55b25a3c", "value of answer2_4 is not correct"
assert sha1(str(answer2_4).encode("utf-8")+b"52023d14a47a9ea4").hexdigest() == "7e616a50e279bac96ffaefba5a0b00eca89bd639", "correct string value of answer2_4 but incorrect case of letters"

print('Success!')

**Question 2.5** Multiple Choice:
<br> {points: 1}

Using your answers to **Questions 2.2 & 2.3**, what are the three closest observations to your new point?

A. 1, 3, 2

B. 0, 1, 3

C. 5, 2, 4

D. 3, 4, 2

*Assign your answer to an object called `answer2_5`. Make sure the correct answer is an uppercase letter. Remember to surround your answer with quotation marks (e.g. "F").*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_5)).encode("utf-8")+b"13edee2d960886ca").hexdigest() == "bd9afbe299fb2039ae153d5dc9d0cf4fd71e29ca", "type of answer2_5 is not str. answer2_5 should be an str"
assert sha1(str(len(answer2_5)).encode("utf-8")+b"13edee2d960886ca").hexdigest() == "7067fdc350c70e87d9bd3876e88c3c27f5f34543", "length of answer2_5 is not correct"
assert sha1(str(answer2_5.lower()).encode("utf-8")+b"13edee2d960886ca").hexdigest() == "5b69ef5a51e74350aa1a63da70d661bbe5de2931", "value of answer2_5 is not correct"
assert sha1(str(answer2_5).encode("utf-8")+b"13edee2d960886ca").hexdigest() == "fc088971811449ec2d13c79a82963bc107f54cce", "correct string value of answer2_5 but incorrect case of letters"

print('Success!')

**Question 2.6** Multiple Choice: 
<br> {points: 1}

We will now use the K-nearest neighbour classification algorithm with K = 3 to classify the new observation using your answers to **Questions 2.2 & 2.3**. Is the new data point predicted to be benign or malignant?

*Assign your answer to an object called `answer2_6`. Make sure the correct answer is written fully. Remember to surround your answer with quotation marks (e.g. "Benign" / "Malignant").*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_6)).encode("utf-8")+b"9355780c1b7ed976").hexdigest() == "1f2251a1bbaa86f89b2dfc54f01b3baf1caf5507", "type of answer2_6 is not str. answer2_6 should be an str"
assert sha1(str(len(answer2_6)).encode("utf-8")+b"9355780c1b7ed976").hexdigest() == "ff28f011dc040f105cd96380a3f3a777e68be412", "length of answer2_6 is not correct"
assert sha1(str(answer2_6.lower()).encode("utf-8")+b"9355780c1b7ed976").hexdigest() == "60b7ce7671576b608bda567b7839797bc53df4c3", "value of answer2_6 is not correct"
assert sha1(str(answer2_6).encode("utf-8")+b"9355780c1b7ed976").hexdigest() == "44701250f76783d2f48125b56f407e5d0c2b5909", "correct string value of answer2_6 but incorrect case of letters"

print('Success!')

**Question 2.7**
<br> {points: 1}

Compare your answers in 2.4 and 2.6. Are they the same?

*Assign your answer to an object called `answer2_7`. Make sure the correct answer is written in lower-case. Remember to surround your answer with quotation marks (e.g. "yes" / "no").* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_7)).encode("utf-8")+b"f5c97bd490a76695").hexdigest() == "d45603d07e08add222fcff2fade55f622fb88b0b", "type of answer2_7 is not str. answer2_7 should be an str"
assert sha1(str(len(answer2_7)).encode("utf-8")+b"f5c97bd490a76695").hexdigest() == "de24194f1ce13b69d158dee2315ae1e82919d50a", "length of answer2_7 is not correct"
assert sha1(str(answer2_7.lower()).encode("utf-8")+b"f5c97bd490a76695").hexdigest() == "e6f3c12893dcbc709595f809630c6ae1810dec1c", "value of answer2_7 is not correct"
assert sha1(str(answer2_7).encode("utf-8")+b"f5c97bd490a76695").hexdigest() == "e6f3c12893dcbc709595f809630c6ae1810dec1c", "correct string value of answer2_7 but incorrect case of letters"

print('Success!')

## 3. Using `scikit-learn` to perform k-nearest neighbours

Now that we understand how K-nearest neighbours (k-nn) classification works, let's get familar with the `scikit-learn` Python package. The benefit of using `scikit-learn` is that it will keep our code simple, readable and accurate. Coding less and in a tidier format means that there is less chance for errors to occur.  

We'll again focus on `Radius` and `Symmetry` as the two predictors. This time, we would like to predict the class of a new observation with `Symmetry = 1` and `Radius = 0`. This one is a bit tricky to do visually from the plot below, and so is a motivating example for us to compute the prediction using k-nn with the `scikit-learn` package. Let's use `K = 7`.

In [None]:
# Run this to remind yourself what the data looks like
cancer_plot

**Question 3.1** 
<br> {points: 1}

Create a **model** for K-nearest neighbours classification by using the `KNeighborsClassifier` function. Specify that we want to set `n_neighbors = 7`.

Name your model specification `knn_spec`.

In [None]:
# ___ = KNeighborsClassifier(n_neighbors=___)

# your code here
raise NotImplementedError
knn_spec

In [None]:
from hashlib import sha1
assert sha1(str(type(knn_spec is None)).encode("utf-8")+b"59b5ca95ca461ed7").hexdigest() == "8fcd86e8245321a29f425394fb42ec199638f756", "type of knn_spec is None is not bool. knn_spec is None should be a bool"
assert sha1(str(knn_spec is None).encode("utf-8")+b"59b5ca95ca461ed7").hexdigest() == "59be28fb621fa0665728329335fb64f9cee3b58c", "boolean value of knn_spec is None is not correct"

assert sha1(str(type(knn_spec.n_neighbors)).encode("utf-8")+b"2490056517d0398f").hexdigest() == "ce3485dff53b472b5eb15c6f800ccfb64d158319", "type of knn_spec.n_neighbors is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_spec.n_neighbors).encode("utf-8")+b"2490056517d0398f").hexdigest() == "a629a2b496bc6cb4d67e20bde52f7b05856664af", "value of knn_spec.n_neighbors is not correct"

assert sha1(str(type(knn_spec.algorithm)).encode("utf-8")+b"ae4813992b08fa51").hexdigest() == "f1272247d9022ea5b2a4bef1cb7d7146bbefcb76", "type of knn_spec.algorithm is not str. knn_spec.algorithm should be an str"
assert sha1(str(len(knn_spec.algorithm)).encode("utf-8")+b"ae4813992b08fa51").hexdigest() == "0b85084c5b209b84ce58e350210f6f2dcba5596a", "length of knn_spec.algorithm is not correct"
assert sha1(str(knn_spec.algorithm.lower()).encode("utf-8")+b"ae4813992b08fa51").hexdigest() == "45639c5b8d2bc186846bc976b4848c5bf804360b", "value of knn_spec.algorithm is not correct"
assert sha1(str(knn_spec.algorithm).encode("utf-8")+b"ae4813992b08fa51").hexdigest() == "45639c5b8d2bc186846bc976b4848c5bf804360b", "correct string value of knn_spec.algorithm but incorrect case of letters"

print('Success!')

**Question 3.2**
<br> {points: 1}

To train the model on the breast cancer dataset, pass `knn_spec` and the `cancer` dataset to the `.fit` function. Specify `Class` as your target variable and the `Symmetry` and `Radius` variables as your predictors. Name your fitted model as `knn_fit`.

In [None]:
# X = ___[["Symmetry", ___]]
# y = ___[___]
# ___ = ___.fit(___, ___)

# your code here
raise NotImplementedError
knn_fit

In [None]:
from hashlib import sha1
assert sha1(str(type(knn_fit is None)).encode("utf-8")+b"0d144c682ca9ee59").hexdigest() == "eb14afcf2f290fcfba0b4546dbb14a058b64a9af", "type of knn_fit is None is not bool. knn_fit is None should be a bool"
assert sha1(str(knn_fit is None).encode("utf-8")+b"0d144c682ca9ee59").hexdigest() == "206b31bb8495ad5aa145013df2c7ce87133c28dc", "boolean value of knn_fit is None is not correct"

assert sha1(str(type(type(knn_fit))).encode("utf-8")+b"e6e4f32ce7c8781b").hexdigest() == "10c8ad09e24c85c72b18f8a21077240b60c7e41b", "type of type(knn_fit) is not correct"
assert sha1(str(type(knn_fit)).encode("utf-8")+b"e6e4f32ce7c8781b").hexdigest() == "7200c6361c5ca1b9d19e28dd47b9bfc32dab1300", "value of type(knn_fit) is not correct"

assert sha1(str(type(knn_fit.classes_)).encode("utf-8")+b"15f37663b41bebb4").hexdigest() == "33c537c8b3791080d08378af69bcb157c1bce04a", "type of knn_fit.classes_ is not correct"
assert sha1(str(knn_fit.classes_).encode("utf-8")+b"15f37663b41bebb4").hexdigest() == "5efe58dd3c6fdeae14a37af0ff0e2deceb5e7d50", "value of knn_fit.classes_ is not correct"

assert sha1(str(type(knn_fit.effective_metric_)).encode("utf-8")+b"5a392e71762c14bd").hexdigest() == "e2b6d61489d8aeab98594fed13143caba47ff815", "type of knn_fit.effective_metric_ is not str. knn_fit.effective_metric_ should be an str"
assert sha1(str(len(knn_fit.effective_metric_)).encode("utf-8")+b"5a392e71762c14bd").hexdigest() == "fe1c2e5ec56b5cd08f338eb3748e649f287518b0", "length of knn_fit.effective_metric_ is not correct"
assert sha1(str(knn_fit.effective_metric_.lower()).encode("utf-8")+b"5a392e71762c14bd").hexdigest() == "51284b4b45f37b8bc64aa4229e93674989792b6c", "value of knn_fit.effective_metric_ is not correct"
assert sha1(str(knn_fit.effective_metric_).encode("utf-8")+b"5a392e71762c14bd").hexdigest() == "51284b4b45f37b8bc64aa4229e93674989792b6c", "correct string value of knn_fit.effective_metric_ but incorrect case of letters"

assert sha1(str(type(knn_fit.n_features_in_)).encode("utf-8")+b"f8c892bf034159ed").hexdigest() == "5f44359b80be7f488918fc333dbfbdb938982a67", "type of knn_fit.n_features_in_ is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_fit.n_features_in_).encode("utf-8")+b"f8c892bf034159ed").hexdigest() == "784d85c02fa36d3002a07f331f3548b608295969", "value of knn_fit.n_features_in_ is not correct"

assert sha1(str(type(X.columns.values)).encode("utf-8")+b"1fc655a29096e480").hexdigest() == "c7dd41a77bb83e7f5679cdb2edcdb608d13075b0", "type of X.columns.values is not correct"
assert sha1(str(X.columns.values).encode("utf-8")+b"1fc655a29096e480").hexdigest() == "3782356f3cfc197d71f99ddb001615d327db0ec9", "value of X.columns.values is not correct"

assert sha1(str(type(y.name)).encode("utf-8")+b"0aecc345bf240ecd").hexdigest() == "acb0183863b27b3384cbf29daef3010dbe01b8ed", "type of y.name is not str. y.name should be an str"
assert sha1(str(len(y.name)).encode("utf-8")+b"0aecc345bf240ecd").hexdigest() == "17fd1d71da0b024459484422aca34a2a1083e041", "length of y.name is not correct"
assert sha1(str(y.name.lower()).encode("utf-8")+b"0aecc345bf240ecd").hexdigest() == "11615b32315f7dc738b4bd2a693d1411ce1f9b49", "value of y.name is not correct"
assert sha1(str(y.name).encode("utf-8")+b"0aecc345bf240ecd").hexdigest() == "754d30a23808f985bf4a6fa1e2827fa3150a88a4", "correct string value of y.name but incorrect case of letters"

assert sha1(str(type(sum(X.Symmetry))).encode("utf-8")+b"22f8306a324a08d1").hexdigest() == "3fcede9e0c48d8011d693e9245686c61b43a87a4", "type of sum(X.Symmetry) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X.Symmetry), 2)).encode("utf-8")+b"22f8306a324a08d1").hexdigest() == "a325eb25dbe4d341d231d757bca56c7cc9aa14f0", "value of sum(X.Symmetry) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(X.Radius))).encode("utf-8")+b"ef94f3175027af5a").hexdigest() == "5e9849b87d513ff5fec499eccb81fca30bbea543", "type of sum(X.Radius) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X.Radius), 2)).encode("utf-8")+b"ef94f3175027af5a").hexdigest() == "19a29a51286e947d49364b8136a1969cc9845ded", "value of sum(X.Radius) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 3.3**
<br>{points: 1}

Now we will make our prediction on the `Class` of a new observation with a `Symmetry` of 1 and a `Radius` of 0. First, create a dataframe with these variables and values and call it `new_obs`. Next, use the `.predict` function to obtain our prediction by passing `knn_fit` and `new_obs` to it. Name your predicted class as `class_prediction`.

In [None]:
# ___ = pd.DataFrame([[1, 0]], columns=[___, ___])
# ___ = ___.predict(___)

# your code here
raise NotImplementedError
class_prediction

In [None]:
from hashlib import sha1
assert sha1(str(type(new_obs is None)).encode("utf-8")+b"6a0466f4c150ace4").hexdigest() == "cd8d4e2a9143bf140c068cf60e25b491159a330d", "type of new_obs is None is not bool. new_obs is None should be a bool"
assert sha1(str(new_obs is None).encode("utf-8")+b"6a0466f4c150ace4").hexdigest() == "060d9ee2f5285f4cf9c35ccdd6f002a79d9db21a", "boolean value of new_obs is None is not correct"

assert sha1(str(type(new_obs)).encode("utf-8")+b"698ac803d315bdb8").hexdigest() == "3e2993a894ccff7781253aa360e43a5629aa8722", "type of type(new_obs) is not correct"

assert sha1(str(type(new_obs.Symmetry.values)).encode("utf-8")+b"f0cee03744477c3b").hexdigest() == "cfc966dc6a0362f9feebed72ac2817057572ab72", "type of new_obs.Symmetry.values is not correct"
assert sha1(str(new_obs.Symmetry.values).encode("utf-8")+b"f0cee03744477c3b").hexdigest() == "7be8ca0b0092accf2c765237f55b9e760e483374", "value of new_obs.Symmetry.values is not correct"

assert sha1(str(type(new_obs.Radius.values)).encode("utf-8")+b"ff304a0533f0fd3f").hexdigest() == "e71883d4c161c423a62c352cb3bf1812aacfc70e", "type of new_obs.Radius.values is not correct"
assert sha1(str(new_obs.Radius.values).encode("utf-8")+b"ff304a0533f0fd3f").hexdigest() == "7ed1e31f686d41f91122065d94b74ad254ce59d6", "value of new_obs.Radius.values is not correct"

assert sha1(str(type(class_prediction is None)).encode("utf-8")+b"2eca4145d1c57f7f").hexdigest() == "b92bbf895b3945cd17b8dbc9686f8a6522b9d031", "type of class_prediction is None is not bool. class_prediction is None should be a bool"
assert sha1(str(class_prediction is None).encode("utf-8")+b"2eca4145d1c57f7f").hexdigest() == "4bd434233c1b2c2e425cf8ff5e7b3d394de2bb52", "boolean value of class_prediction is None is not correct"

assert sha1(str(type(class_prediction)).encode("utf-8")+b"a24cb8f3c3640627").hexdigest() == "4eafdee43e345add5604a1592a5c6cd66641d3e4", "type of class_prediction is not correct"
assert sha1(str(class_prediction).encode("utf-8")+b"a24cb8f3c3640627").hexdigest() == "d588f240d98c9b5019413121f2742543e1279f11", "value of class_prediction is not correct"

print('Success!')

**Question 3.4**
<br> {points: 1}

Let's perform K-nearest neighbour classification again, but with three predictors. Use the `scikit-learn` package and `K = 7` to classify a new observation where we measure `Symmetry = 1`, `Radius = 0` and `Concavity = 1`. Use the scaffolding from **Questions 3.2** and **3.3** to help you.

- Pass the same `knn_spec` from before to `fit`, but this time specify `Symmetry`, `Radius`, and `Concavity` as the predictors. Save the predictor as `X_2` and the target as `y_2`. Store the output in `knn_fit_2`. 
- Store the new observation values in an object called `new_obs_2`.
- Store the output of `predict` in an object called `class_prediction_2`.

In [None]:
# your code here
raise NotImplementedError
class_prediction_2

In [None]:
from hashlib import sha1
assert sha1(str(type(knn_fit_2 is None)).encode("utf-8")+b"188e847439d7cc67").hexdigest() == "6620c81d13e9f4159b3ee9f604b291a500018efb", "type of knn_fit_2 is None is not bool. knn_fit_2 is None should be a bool"
assert sha1(str(knn_fit_2 is None).encode("utf-8")+b"188e847439d7cc67").hexdigest() == "f707cc299eaee0b82cf3e9929350700702bd4537", "boolean value of knn_fit_2 is None is not correct"

assert sha1(str(type(knn_fit_2.kneighbors)).encode("utf-8")+b"972b5aa5ace0521e").hexdigest() == "831e84a462261dd8acbe59e9ab4cc04a93fc79a6", "type of knn_fit_2.kneighbors is not correct"
assert sha1(str(knn_fit_2.kneighbors).encode("utf-8")+b"972b5aa5ace0521e").hexdigest() == "044f4bc38d6e2fd94c593768c22fdf8d614cd5a1", "value of knn_fit_2.kneighbors is not correct"

assert sha1(str(type(knn_fit_2.effective_metric_)).encode("utf-8")+b"beb0ae69a9a4622b").hexdigest() == "9e291c5efa4223ef2df8ef26764843ae679f4a29", "type of knn_fit_2.effective_metric_ is not str. knn_fit_2.effective_metric_ should be an str"
assert sha1(str(len(knn_fit_2.effective_metric_)).encode("utf-8")+b"beb0ae69a9a4622b").hexdigest() == "204967c27e6bcc289c398b1a4e4eaaa7f2627c9f", "length of knn_fit_2.effective_metric_ is not correct"
assert sha1(str(knn_fit_2.effective_metric_.lower()).encode("utf-8")+b"beb0ae69a9a4622b").hexdigest() == "5698f499a9b07773db7a737da5fdb6a290aa3174", "value of knn_fit_2.effective_metric_ is not correct"
assert sha1(str(knn_fit_2.effective_metric_).encode("utf-8")+b"beb0ae69a9a4622b").hexdigest() == "5698f499a9b07773db7a737da5fdb6a290aa3174", "correct string value of knn_fit_2.effective_metric_ but incorrect case of letters"

assert sha1(str(type(type(knn_fit_2))).encode("utf-8")+b"4d34e2c5129dc231").hexdigest() == "095e494b4098fb8646cee988f4bbc1777a785a29", "type of type(knn_fit_2) is not correct"
assert sha1(str(type(knn_fit_2)).encode("utf-8")+b"4d34e2c5129dc231").hexdigest() == "264b891694a2db9a262332709d9fe03fe9ed9705", "value of type(knn_fit_2) is not correct"

assert sha1(str(type(knn_fit_2.n_features_in_)).encode("utf-8")+b"6ea6243cf55ada9b").hexdigest() == "2112741d6ee6d98b0fe3d842470020c2829c09b2", "type of knn_fit_2.n_features_in_ is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_fit_2.n_features_in_).encode("utf-8")+b"6ea6243cf55ada9b").hexdigest() == "ff6202c9411c863da33fbd45b06eabb354e5bb05", "value of knn_fit_2.n_features_in_ is not correct"

assert sha1(str(type(X_2.columns.values)).encode("utf-8")+b"fb6b84dd1a9b40e0").hexdigest() == "3d4d87a751a37eb48f0fc4bbef02fc892c923c48", "type of X_2.columns.values is not correct"
assert sha1(str(X_2.columns.values).encode("utf-8")+b"fb6b84dd1a9b40e0").hexdigest() == "41adbc71447965911bf99782db4afa6a12384d9a", "value of X_2.columns.values is not correct"

assert sha1(str(type(y_2.name)).encode("utf-8")+b"b308ff77cf42f39c").hexdigest() == "190e76a5f1e0bf56d95466f903146c91d91b8fa6", "type of y_2.name is not str. y_2.name should be an str"
assert sha1(str(len(y_2.name)).encode("utf-8")+b"b308ff77cf42f39c").hexdigest() == "6ec82bc71a6098293f0f4d3437f8f52dbb59be5f", "length of y_2.name is not correct"
assert sha1(str(y_2.name.lower()).encode("utf-8")+b"b308ff77cf42f39c").hexdigest() == "f4e5d8b48a3db7226cdec4e5baf6872a428cac32", "value of y_2.name is not correct"
assert sha1(str(y_2.name).encode("utf-8")+b"b308ff77cf42f39c").hexdigest() == "1e9354603f5164471a6142f4dcbfb7bfc3f13e50", "correct string value of y_2.name but incorrect case of letters"

assert sha1(str(type(y_2.values)).encode("utf-8")+b"f63cb9766badf2bb").hexdigest() == "687813a0f0c53b0bd4e2f56a407ec02ddeea568b", "type of y_2.values is not correct"
assert sha1(str(y_2.values).encode("utf-8")+b"f63cb9766badf2bb").hexdigest() == "4bfc87a5095a62560f16e8d10caa5833a092b329", "value of y_2.values is not correct"

assert sha1(str(type(sum(X_2.Symmetry))).encode("utf-8")+b"a2b057d5374f4731").hexdigest() == "dfbbf0a9cb58cccff2d06ec8a71cce4b6ecf2e2d", "type of sum(X_2.Symmetry) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X_2.Symmetry), 2)).encode("utf-8")+b"a2b057d5374f4731").hexdigest() == "7de598c1c1d61f80415b5a08e206ed71de0b2074", "value of sum(X_2.Symmetry) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(X_2.Radius))).encode("utf-8")+b"747bc7f25a7f64b2").hexdigest() == "e838dc528ed522f23c26423a4a5854133f65bca3", "type of sum(X_2.Radius) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X_2.Radius), 2)).encode("utf-8")+b"747bc7f25a7f64b2").hexdigest() == "6bd3556bfaa4160bb73cfcaaa58cec1333d45735", "value of sum(X_2.Radius) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(X_2.Concavity))).encode("utf-8")+b"7ee9b0dbea089ece").hexdigest() == "931a2fb797e63e3a9cf9cad62739c41f260f4433", "type of sum(X_2.Concavity) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X_2.Concavity), 2)).encode("utf-8")+b"7ee9b0dbea089ece").hexdigest() == "7b6120d7c77a74a152ab3b380f41479513342b6b", "value of sum(X_2.Concavity) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(new_obs_2 is None)).encode("utf-8")+b"108418493b726b87").hexdigest() == "9752647819333d62d1021817bbe14be63a64a870", "type of new_obs_2 is None is not bool. new_obs_2 is None should be a bool"
assert sha1(str(new_obs_2 is None).encode("utf-8")+b"108418493b726b87").hexdigest() == "59ffb82f93200dd8e34a72ade893a5c2ed9561ec", "boolean value of new_obs_2 is None is not correct"

assert sha1(str(type(new_obs_2)).encode("utf-8")+b"32470812fb5ba48b").hexdigest() == "a2587b7b2a024fbca5b722e9d36a28da5f7a6139", "type of type(new_obs_2) is not correct"

assert sha1(str(type(new_obs_2.Symmetry.values)).encode("utf-8")+b"bf3e375ee179df5e").hexdigest() == "7a26842481000ee5936cd33ce9b8fa9a696ab199", "type of new_obs_2.Symmetry.values is not correct"
assert sha1(str(new_obs_2.Symmetry.values).encode("utf-8")+b"bf3e375ee179df5e").hexdigest() == "9136694b15a055431c2d3de225cbdf29ca5e31d2", "value of new_obs_2.Symmetry.values is not correct"

assert sha1(str(type(new_obs_2.Radius.values)).encode("utf-8")+b"bcb41476a727b484").hexdigest() == "a49d6d21d0c8e374ad736ca3eed40c54f68e67a9", "type of new_obs_2.Radius.values is not correct"
assert sha1(str(new_obs_2.Radius.values).encode("utf-8")+b"bcb41476a727b484").hexdigest() == "a4b0a3bd4066b69f7ff9d435c795fd0271216a42", "value of new_obs_2.Radius.values is not correct"

assert sha1(str(type(new_obs_2.Concavity.values)).encode("utf-8")+b"edfa4d828982871a").hexdigest() == "33f2db6f19a4413d7852ef6b688075acc4300570", "type of new_obs_2.Concavity.values is not correct"
assert sha1(str(new_obs_2.Concavity.values).encode("utf-8")+b"edfa4d828982871a").hexdigest() == "1b6ac39d470b01c407f339c46f16961aaaa2e6f8", "value of new_obs_2.Concavity.values is not correct"

assert sha1(str(type(class_prediction_2 is None)).encode("utf-8")+b"d506eb77ec3ac2ee").hexdigest() == "35cd8da6ccc0ed33bfe01a263286e91adc715f7b", "type of class_prediction_2 is None is not bool. class_prediction_2 is None should be a bool"
assert sha1(str(class_prediction_2 is None).encode("utf-8")+b"d506eb77ec3ac2ee").hexdigest() == "673397f866f1404a27e7fac6d99b9fd99f64337b", "boolean value of class_prediction_2 is None is not correct"

assert sha1(str(type(class_prediction_2)).encode("utf-8")+b"639f90a14d5ce563").hexdigest() == "5c9d76fe3746e26f099c7d7df10f756ad5629748", "type of class_prediction_2 is not correct"
assert sha1(str(class_prediction_2).encode("utf-8")+b"639f90a14d5ce563").hexdigest() == "72d00de34c02decbef1d21a147d19d3620af3f34", "value of class_prediction_2 is not correct"

print('Success!')

**Question 3.5**
<br>{points: 1}

Finally, we will perform K-nearest neighbour classification again, using the `scikit-learn` package and `K = 7` to classify a new observation where we use **all the predictors** in our data set (we give you the values in the code below). 

But we first have to do one important thing: we need to remove the ID variable from the analysis (it's not a numerical measurement that we should use for classification). Thankfully, `scikit-learn` provides a nice way of combining data preprocessing and training into a single consistent pipeline.

We will first create a preprocessor to remove the `ID` variable using the `drop` preprocessing step. Since we aren't doing any preprocessing to other columns, we will set the `remainder` parameter to `passthrough`. Do so below using the provided scaffolding. Name the preprocessor object `knn_preprocessor`.


In [None]:
# ___ = make_column_transformer(
#     ("drop", [___]),
#     remainder=___
# )

# your code here
raise NotImplementedError
knn_preprocessor

In [None]:
from hashlib import sha1
assert sha1(str(type(knn_preprocessor is None)).encode("utf-8")+b"d129ef9e5ccf4c5c").hexdigest() == "33984d6f3a9b94c56cd5ace96adbda1fd84a2b29", "type of knn_preprocessor is None is not bool. knn_preprocessor is None should be a bool"
assert sha1(str(knn_preprocessor is None).encode("utf-8")+b"d129ef9e5ccf4c5c").hexdigest() == "96d2c90624e477637692adef9c3e4ebc42a0747c", "boolean value of knn_preprocessor is None is not correct"

assert sha1(str(type(type(knn_preprocessor))).encode("utf-8")+b"cf447305939461eb").hexdigest() == "643bfd3d023177c83d69627fcaa5dec112f7cc62", "type of type(knn_preprocessor) is not correct"
assert sha1(str(type(knn_preprocessor)).encode("utf-8")+b"cf447305939461eb").hexdigest() == "3910413219ef0008361855d74945dbdbae93b55f", "value of type(knn_preprocessor) is not correct"

assert sha1(str(type(knn_preprocessor.get_feature_names_out)).encode("utf-8")+b"c853aba1b49bca96").hexdigest() == "d48a7e3e2fdcdffe8dfc5aae848dd992f2f6a920", "type of knn_preprocessor.get_feature_names_out is not correct"
assert sha1(str(knn_preprocessor.get_feature_names_out).encode("utf-8")+b"c853aba1b49bca96").hexdigest() == "7a39338e216f80f4e73d11f55e32b399ecd76550", "value of knn_preprocessor.get_feature_names_out is not correct"

print('Success!')

**Question 3.6**
<br> {points: 1}

Create a **pipeline** that includes the new preprocessor (`knn_preprocessor`) and the model specification (`knn_spec`) using the scaffolding below. Name the pipeline object `knn_pipeline`.

In [None]:
# ___ = make_pipeline(___, ___)

# your code here
raise NotImplementedError
knn_pipeline

In [None]:
from hashlib import sha1
assert sha1(str(type(knn_pipeline is None)).encode("utf-8")+b"9b68513415814880").hexdigest() == "abdfd9b4539270df3e1cba6824af5da8ce681f51", "type of knn_pipeline is None is not bool. knn_pipeline is None should be a bool"
assert sha1(str(knn_pipeline is None).encode("utf-8")+b"9b68513415814880").hexdigest() == "46a4a29727dbc26b4d0672fa6a0412f19c984de4", "boolean value of knn_pipeline is None is not correct"

assert sha1(str(type(type(knn_pipeline))).encode("utf-8")+b"ce06bf60195b746b").hexdigest() == "8f852e6e55ed178d37bc9940abf6d91e1418f788", "type of type(knn_pipeline) is not correct"
assert sha1(str(type(knn_pipeline)).encode("utf-8")+b"ce06bf60195b746b").hexdigest() == "006b756e683212a6c9ac8b0ba1b804fffa7b70f1", "value of type(knn_pipeline) is not correct"

assert sha1(str(type(knn_pipeline.named_steps.kneighborsclassifier.n_neighbors)).encode("utf-8")+b"4b6c7aa52c5c4d09").hexdigest() == "59eb1cac12a2a26bd4476593b1fcd2ec8754613a", "type of knn_pipeline.named_steps.kneighborsclassifier.n_neighbors is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_pipeline.named_steps.kneighborsclassifier.n_neighbors).encode("utf-8")+b"4b6c7aa52c5c4d09").hexdigest() == "65afafcafd88be2d1ca6ac2ed215f2c8df54d363", "value of knn_pipeline.named_steps.kneighborsclassifier.n_neighbors is not correct"

print('Success!')

**Question 3.7**
{points: 1}

Finally, `fit` the pipeline and predict the class label for the new observation named `new_obs_all`. Name the `fit` object `knn_fit_all`, and the class prediction `class_prediction_all`. Name the new predictor as `X_3` and the new target as `y_3`.

In [None]:
new_obs_all = pd.DataFrame(
    [[None, 0, 0, 0, 0, 0.5, 0, 1, 0, 1, 0]],
    columns=[
        "ID",
        "Radius",
        "Texture",
        "Perimeter",
        "Area",
        "Smoothness",
        "Compactness",
        "Concavity",
        "Concave_points",
        "Symmetry",
        "Fractal_dimension",
    ],
)
# X_3 = cancer.drop(columns=[___])
# y_3 = cancer[___]
# ___ = knn_pipeline.fit(___, ___)
# ___ = knn_fit_all.____(____)

# your code here
raise NotImplementedError
class_prediction_all

In [None]:
from hashlib import sha1
assert sha1(str(type(class_prediction_all)).encode("utf-8")+b"c237ed494e185f75").hexdigest() == "7de8ee712e5a5432052bc277539822ff62a21052", "type of class_prediction_all is not correct"
assert sha1(str(class_prediction_all).encode("utf-8")+b"c237ed494e185f75").hexdigest() == "329c503b0bb3b6ccff670e06f144bd5f671a6fd1", "value of class_prediction_all is not correct"

print('Success!')

## 4. Reviewing Some Concepts

We will conclude with two multiple choice questions to reinforce some key concepts when doing classification with K-nearest neighbours.

**Question 4.0**
<br> {points: 1}

In the K-nearest neighbours classification algorithm, we calculate the distance between the new observation (for which we are trying to predict the class/label/outcome) and each of the observations in the training data set so that we can:

A. Find the `K` nearest neighbours of the new observation

B. Assess how well our model fits the data

C. Find outliers

D. Assign the new observation to a cluster

*Assign your answer (e.g. "E") to an object called: `answer4_0`. Make sure your answer is an uppercase letter and is surrounded with quotation marks.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer4_0)).encode("utf-8")+b"ce6775d36a269265").hexdigest() == "26fa7d615b1baa3c198d43644d18d061630a5437", "type of answer4_0 is not str. answer4_0 should be an str"
assert sha1(str(len(answer4_0)).encode("utf-8")+b"ce6775d36a269265").hexdigest() == "33ba0346defc95da049b724d57d52d87b9107c09", "length of answer4_0 is not correct"
assert sha1(str(answer4_0.lower()).encode("utf-8")+b"ce6775d36a269265").hexdigest() == "4a6de4d5352feed155e47dedc3137885495e2271", "value of answer4_0 is not correct"
assert sha1(str(answer4_0).encode("utf-8")+b"ce6775d36a269265").hexdigest() == "cb0bb6b08f9b5ba0c819f80d5a6e8d485d9d4d2a", "correct string value of answer4_0 but incorrect case of letters"

print('Success!')

**Question 4.1**
<br> {points: 1}

In the K-nearest neighbours classification algorithm, we choose the label/class for a new observation by:

A. Taking the mean (average value) label/class of the K nearest neighbours 

B. Taking the median (middle value) label/class of the K nearest neighbours 

C. Taking the mode (value that appears most often, *i.e.*, the majority vote) label/class of the K nearest neighbours 

*Assign your answer (e.g., "E") to an object called `answer4_1`. Make sure your answer is an uppercase letter and is surrounded with quotation marks.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer4_1)).encode("utf-8")+b"9d54192e6936d81d").hexdigest() == "81425d9d8f9cd61f89ccde83cc751c9cacfb498d", "type of answer4_1 is not str. answer4_1 should be an str"
assert sha1(str(len(answer4_1)).encode("utf-8")+b"9d54192e6936d81d").hexdigest() == "cb684a531f168d8ed86a2e60a6fe58e0cc35c04d", "length of answer4_1 is not correct"
assert sha1(str(answer4_1.lower()).encode("utf-8")+b"9d54192e6936d81d").hexdigest() == "b23ed1f675d094797aeeac0ceb18ff87d45ade89", "value of answer4_1 is not correct"
assert sha1(str(answer4_1).encode("utf-8")+b"9d54192e6936d81d").hexdigest() == "96e4a285225180ab278329026112f3cde3f15895", "correct string value of answer4_1 but incorrect case of letters"

print('Success!')