Data Processing

Please note, this is a adaptation of the student-facing pages and was designed for use in a brief conference session. The complete student-facing version is available on bjc.edc.org: Unit 5 Lab 3 Page 3: Importing and Accessing Data, Unit 5 Lab 3 Page 4: Analyzing Data, Unit 5 Lab 3 Page 5: Visualizing Data.

On this page, you will answer questions about data that interests you.

Importing and Accessing Data

  1. Saving is optional for this workshop session.
    Click here to load this file. Then save it to your Snap! account.
    The project includes number of selectors that students actually build themselves. Here are the first five:
    • headings of table () takes a table as input and reports just the row of headings from a table
    • data of table () takes a table as input and reports all the rows of data (but not the headings) from the table
    • record () of table () takes a table of data and a record (row) number as input and reports that record from the data
    • field () of record () takes a single record (row) of a table of data and a field number as input and reports that field (item) from the record
    • column () of table () takes a table of data and a column number as input and reports that column from the data

    Notice that these block names include the word "table" or "record" before the second input. These expected input data types can help you avoid bugs caused by using an input that does not match that the selector expects to receive.

    three frame animation of the report of cars dataset displayed as a table with columns and rows; in the first frame, the fourth row of the table is highlighted and labeled 'record (row)'; in the second frame, the third column of the table is highlighted and labeled 'column'; in the third frame, the cell in the fourth row and third column is highlighted and labeled 'field'

    A table is represented in Snap! as a list of lists. If you right-click (or control-click on a Mac) a table, you can switch to "list view" and see how the data (and column headings) are stored. See examples of table view and list view.

    Table View
    List View
    report of cars dataset displayed as a table with columns and rows; the first row is the label of each column; the remaining rows each contain the data for a single record
    report of cars dataset displayed as a list of lists; the first list contains the labels for each of the columns shown in the table view; the remaining lists each contain the data for a single record
  2. What does CSV stand for?
    CSV stands for "comma separated values." CSV files are tables of data stored with commas between each item in a row and line breaks between each row in the table.
  3. Visit the CORGIS Datasets Project and select a dataset you'd like to explore. Download the CSV file for the data you want to explore.
  4. Open Snap! and drag the downloaded file into the Snap! window. You should see a table full of data. Look over the data (including the column headings in the top row) to give yourself a sense of what kind of information is included. Then click "OK" to close the window. You can still see the data in the watcher on the Snap! stage.
  5. Talk with Your Partner Determine one question you can answer by looking at a single column of your data set, and then build code to answer your question.

    You can see the column number by holding your mouse pointer over the letter at the top of the column in table view.
    image of the top of the table view for the cars dataset with the mouse pointer over the top of the second column; the columns are now labeled A, 2 (where the pointer is), C, D, etc.

    You may need to use map, keep, or combine to answer your question. Click to see where you learned about these higher order functions.

    Click for example questions to ask about a single column.

    • What's the average MPG that cars in this database get in the city? (You'd need an average block.)
    • What's the year of the oldest car in this dataset? (You'd need a minimum block.)
    • How many cars in this dataset have manual transmission?

    Notice that all of these examples only require data from one column. If you want to ask a question that requires looking at another column (for example, "What's the model of the car with the highest MPG?"), you can do the Take It Further Activity below.

Analyzing Data

DAT-2.D.4
You can ask questions about a specific subset of your data by filtering the data using keep. Filtering is a powerful technique for finding information and recognizing patterns in data. For example, filtering can help you answer questions like "What is the average city MPG for just the Subarus in this dataset?"
average of (column (9) of table (keep items (field (14) of record 'empty list input slot' = Subaru) from (data of table (cars))) reporting 19.704...
Column 14 is the "Make" of the vehicle, so we keep all the records from cars for which the 14th field is "Subaru." Then, we take column 9 of those records (the "City MPG") and find their average.

Notice that there are many digits in the answer above. How many digits are given in the table for each car's MPG? An important rule in data science is not to claim more precision in a result than is warranted by the given data, so this answer should be rounded to 19.

  1. In addition to the headings, data, record, field, column blocks described in problem 1 above, your project also includes the following four mathematical blocks that students import from their work in Unit 2 Lab 4 Page 2: Making a Mathematical Library. Test them each with a simple list like {1, 2, 3, 4} to make sure they behave as you expect.
    maximum of list 'list input slot' minimum of list 'list input slot' sum of list 'list input slot' average of list 'list input slot'
    You can look inside a block to see how it works: Right-click (or control-click on a mac) the block and select "edit..." from the menu that appears.
  2. Talk with Your Partner Determine one question you can answer by looking at a single column of a portion of your data set, and then build code to answer that question.
  3. Click for example questions to ask about a portion of a single column.
    • What's the average MPG that Volvos in this database get in the city? (You'd need average.)
    • What's the year of the oldest Honda in this dataset? (You'd need minimum.)
    • How many 2009 cars in this dataset have manual transmission?

    Notice that the column you use to filter the data (such as year) doesn't have to be the column you are asking about (such as transmission).

Sometimes, you want to keep a subset of your data (such as "Which cars were made in 2010?"), but other times, you just want one item that matches your requirement, often because what you really want to know is whether any items match, and as soon as you find one, the answer is "yes" (such as "Were any electric cars made in 2010?"). Snap! has a higher order function find first item () in 'list input slot' that works similarly to keep, but it reports only the first item that's found, so it can be faster.

Find first is a higher order function like keep, map, and combine, because it takes a function (a predicate) as input. (Find first is like item (1) of (keep).)

Click for an example of keep vs. find first.

DAT-2.D.6

You can access or change data to create new information by using:

    DAT-2.D
    Save your work
  1. Talk with Your Partner Ask and answer a question that you can answer by looking at a single column of a portion of your data set and examining just the first matching item. Build code to answer your question.

Grouping Data

DAT-2.E.3 classifying only

Classifying data is extracting groups of data with a common characteristic.

Another thing that's often done in data science is grouping (or classifying) data. For example, here is the cars data grouped by vehicle make (column 14):
group (data of table (cars)) by field (14) by intervals of ( )) reporting a table with three columns; column A contains each make of car from the original data set; column B shows the total number of cars in the original data set that are of that make; column C shows a picture of a list in each row of the table.

The by intervals of input to group should be left empty when, as in this example, the field on which you're grouping is text rather than numbers. Later on this page you'll see how to use intervals in graphing.

  1. Talk with Your Partner Determine one question you can answer by grouping your data, and build code to answer it.
    Click for example questions for which grouping is helpful.
    • How many Toyotas are in the database?
    • Which brand in the table has the most models listed?
    • How many 2010 Hyundais are in the database? (This requires looking inside one of the lists in column C, so you'd need two keep functions.)

Plotting Data

The bar chart function works like the group function, but with special features for numeric data: it allows you to select upper and lower limits of the data; you can have a range of values in one bucket, such as values 6–10, values 11–15, and so on; and it sorts the groups. For example, here is the cars data grouped by city MPG (column 9):

The number in column A is the largest value included in each group. If the values aren't all integers, the next group includes anything larger. For example, the group numbered 15 includes values from 10.0001 (or anything more than 10) to exactly 15.

You can plot the data from bar chart to visualize them:

  1. Plot a few bar charts of some fields from your dataset and make at least one new observation about your data.
  2. The mode of a data set is the value that appears most often in it.

  1. Here is a bar chart of field 11 from the cars data set (highway MPG) with MPG values from 5 to 50, using an interval of 3. Identify the mode. It will be a range of values such as 13–15 or 16–18.
  2. Here is another bar chart with all the inputs the same as before, but with an interval of 6. Identify the mode.
  3. Talk with Your Partner How can these results both be correct? (There's nothing wrong with the graphs.)
  4. Talk with Your Partner Why would you ever use an interval larger than 1? If you have time, research this question.