EX09 - Data wrangling


In Exercise 09, we’ll be constructing 1 last function in our data_utils file we’ve been building up in class. Then we will walk you through some basic analysis putting these functions all together.

0. Pull the skeleton code

You will find the starter files needed by “pulling” from the course workspace repository. Before beginning, be sure to:

  1. Be sure you are in your course workspace. Open the file explorer and you should see your work for the course. If you do not, open your course workspace through File > Open Recent.
  2. Open the Source Control View by clicking the 3-node (circles) graph (connected by lines) icon in your sidebar or opening the command palatte and searching for Source Control.
  3. Click the Ellipses in the Source Control pane and select “Pull” from the drop-down menu. This will begin the pulling process from the course repository. It should silently succeed.
  4. Return to the File Explorer pane and open the exercises directory. You should see it now contains the directory named ex09. If you expand those directories, you will see the starter files for this exercise. You should notice one .py and one .ipynb file in your ex09 directory, as well as some data appear in your data folder.

If the above did not work, try the following:

  1. Click the Ellipses in the Source Control pane and select “Pull, Push” from the drop-down menu. Then select “Pull from”. Then select “upstream” and the main option. This will begin the pulling process from the course repository. It should silently succeed.
  2. Return to the File Explorer pane and open the exercises directory. You should see it now contains another directory named ex09. If you expand that directory, you should see the starter files

Part 1. Count

Before you begin, copy all of your work from this week’s lectures defining the other data_utils functions together into this data_utils.py file. These should include read_csv_rows, column_values, columnar, head, and select.

Create a function in your data_utils.py file called count. It has the following specifications:

  1. It has one parameter, of type list[str] - list of values to count the freqencies of.
  2. It returns a dict[str, int] - a dictionary of the counts of the items in the input list.

Implementation strategy:

  1. Establish an empty dictionary to store your built-up result in
  2. Loop through each item in the input list
    1. Check to see if that item has already been established as a key in your dictionary. Try the following boolean conditional: if <item> in <dict>: – replacing <item> with the variable name of the current value and <dict> with the name of your result dictionary.
    2. If the item is found in the dict, that means there is already a key/value pair where the item is a key. Increase the value associated with that key by 1 (counting it!)
    3. If the item is not found in the dict, that means this is the first time you are encountering the value and should assign an initial count of 1 to that key in the result dictionary.
  3. Return the resulting dictionary.

Part 2. Using these functions to perform a basic analysis.

When you pulled the starter code you should have also noticed a data_wrangling.ipynb file show up in your exercise directory. This will walk you through using these functions to perform some basic data analysis on a real world dataset.

In this exercise, you will move through a very common first set of steps when working with a new data set:

  1. Read the data
  2. Transform it to be in a “shape” that is easier to work with
  3. Preview and select just the parts of the dataset you are interested in
  4. Run (simple, in this notebook) analyses

The sample data set provided alongside this exercise is police stop data from Durham, as compiled by the Stanford Open Policing Project. This is a very small (348 rows out of the many millions the paper authors compiled) subset of the data that can be found on their site.

For those interested, there was a paper published in the Nature Human Behavior Journal in May 2020 where authors compiled a dataset of over 100 million traffic stops across the United States and performed several analyses of the policing decisions. This research was the start of the Stanford Open Policing Project which aims to make this data more accessible for the general public’s use. You can find a PDF copy of the paper to read from the one of authors’ home page here: https://5harad.com/papers/100M-stops.pdf

Be sure to save your work in data_utils.py before reevaluating cells in data_wrangling.ipynb if you are making changes.

There will be 10 manually graded points reserved for checking that you ran the cells in this notebook and worked through it. It should be a nice review of the functions that you will be using in PJ01 to perform your own analysis.

3. Make a Backup Checkpoint “Commit”

As you make progress on this exercise, making backups is encouraged. Note that you do not have to make a backup in order to submit your work, though you are encouraged to before each submission so that you can revert back to a previous point in your project if you accidentally change something you did not intend to.

  1. Open the Source Control panel (Command Palette: “Show SCM” or click the icon with three circles and lines on the activity panel).
  2. Notice the files listed under Changes. These are files you’ve made modifications to since your last backup.
  3. Move your mouse’s cursor over the word Changes and notice the + symbol that appears. Click that plus symbol to add all changes to the next backup. You will now see the files listed under “Staged Changes”.
    • If you do not want to backup all changed files, you can select them individually. For this course you’re encouraged to back everything up.
  4. In the Message box, give a brief description of what you’ve changed and are backing up. This will help you find a specific backup (called a “commit”) if needed. In this case a message such as, “Progress on Exercise 3” will suffice.
  5. Press the Check icon to make a Commit (a version) of your work.
  6. Finally, press the Ellipses icon (…), look for “Pull/Push” submenu, and select “Push to…”, and in the dropdown select your backup repository.

4. Submit to Gradescope for Grading

Login to Gradescope and select the assignment named “EX09 - Data Wrangling”. You’ll see an area to upload a zip file. To produce a zip file for autograding, return back to Visual Studio Code.

If you do not see a Terminal at the bottom of your screen, open the Command Palette and search for “View: Toggle Integrated Terminal”.

To produce a zip file for ex09, type the following command (all on a single line):

python -m tools.submission exercises/ex09

In the file explorer pane, look to find the zip file named “21.mm.dd-hh.mm-exercises-ex09.zip”. The “mm”, “dd”, and so on, are timestamps with the current month, day, hour, minute. If you right click on this file and select “Reveal in File Explorer” on Windows or “Reveal in Finder” on Mac, the zip file’s location on your computer will open. Upload this file to Gradescope to submit your work for this exercise.