📓2.1: Pandas

Table of Contents


What is Data Science?

Data Science
An emerging, interdisciplinary field that brings together ideas that have been around for years, or even centuries about processes and systems to extract knowledge and/or to make predictions from data in various forms.

In 2016 a study reported that 90% of the data in the world today has been created in the last two years alone. This is the result of the continuing acceleration of the rate at which we store data. Some estimates indicate that roughly 2.5 quintillion bytes of data are generated per day; that’s 2,500,000,000,000,000,000 bytes!

By comparison, all the data in the Library of Congress adds up to about 200 TB, merely 200,000,000,000,000 bytes. This means that we are capturing 12,500 libraries of congress per day!

The amount of data that Google alone stores in its servers is estimated to be 15 exabytes (15 followed by 18 zeros!). You can visualize 15 exabytes as a pile of cards three miles high, covering all of New England.

Everywhere you go, someone or something is collecting data about you: what you buy, what you read, where you eat, where you stay, how and when you travel, and so much more. By 2025, it is estimated that 463 exabytes of data will be created each day globally, and the entire digital universe is expected to reach 44 zettabytes by 2020. This would mean there would be 40 times more bytes than there are stars in the observable universe!

Often, this data is collected and stored with little idea about how to use it, because technology makes it so easy to capture. Other times, the data is collected quite intentionally. The big question is: what does it all mean? That’s where data science comes in.

What does a data scientist do?

As an interdisciplinary field of inquiry, data science is perfect for a liberal arts college as well as many other types of universities. Combining statistics, computer science, writing, art, and ethics, data science has application across the entire curriculum: biology, economics, management, English, history, music, pretty much everything. The best thing about data science is that the job of a data scientist seems perfectly suited to many liberal arts students.

“The best data scientists have one thing in common: unbelievable curiosity.” - D.J. Patil, Chief Data Scientist of the United States from 2015 to 2017.

image

The diagram above is widely used to answer the question “What is Data Science?” It also is a great illustration of the liberal arts nature of data science. Some computer science, some statistics, and something from one of the many majors available at a liberal arts college, all of which are looking for people with data skills!


🐼 Data Manipulation with Pandas

image

Using a GitHub Template for class notes

  1. Go to the public template repository for our class: BWL-CS Python Template
  2. Click the button above the list of files then select Create a new repository
  3. Specify the repository name: CS3-Unit-2-Notes
  4. Click

    Now you have your own personal copy of this starter code that you can always access under the Your repositories section of GitHub!

  5. Now on your repository, click and select the Codespaces tab
  6. Click Create Codespace on main and wait for the environment to load, then you’re ready to code!
  7. 📝 Take notes in this Codespace during class, coding along with the instructor.

From NumPy to Pandas

NumPy is a Python library centered around the ndarray object, which enables efficient storage and manipulation of dense typed arrays. Pandas is a newer package built on top of NumPy that provides an efficient implementation of a DataFrame, which is the main data structure offered by Pandas.

NumPy’s limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.

DataFrame
A multidimensional array object with attached row and column labels, often containing heterogeneous types and/or missing data. The concept is similar to a spreadsheet in Excel or Google Sheets, but more versatile.

As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs. Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are. Thus, before we go any further, let’s take a look at these three fundamental Pandas data structures: the Series, DataFrame, and Index.

We will start our code sessions with the standard NumPy and Pandas imports, under the aliases np and pd:

import numpy as np
import pandas as pd

Pandas Objects

The Series Object

Series
A one-dimensional array object of indexed data.

A Series object can be created from a list or array as follows:

data = pd.Series([0.25, 0.5, 0.75, 1.0])

The Series combines a sequence of values with an explicit sequence of indices, which we can access with the values and index attributes. The values are simply a NumPy array:

data.values

The index is an array-like object of type pd.Index, which we’ll discuss in more detail momentarily:

data.index

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

data[1]
data[1:3]

As we will see, though, the Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates.

Series as Generalized NumPy Array

From what we’ve seen so far, the Series object may appear to be basically interchangeable with a one-dimensional NumPy array. The essential difference is that while the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. So, if we wish, we can use strings as an index:

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)

And the item access works as expected:

print(data['b'])

We can even use noncontiguous or nonsequential indices:

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
print(data)
print(data[5])

Series as Specialized Dictionary

In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary, here the five most populous US states according to the 2020 census:

population_dict = {'California': 39538223, 'Texas': 29145505,
                   'Florida': 21538187, 'New York': 20201249,
                   'Pennsylvania': 13002700}
population = pd.Series(population_dict)

From here, typical dictionary-style item access can be performed:

print(population['California'])

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

print(population['California':'Florida'])

Constructing Series Objects

We’ve already seen a few ways of constructing a Pandas Series from scratch. All of them are some version of the following:

pd.Series(data, index=index)

where index is an optional argument, and data can be one of many entities.

For example, data can be a list or NumPy array, in which case index defaults to an integer sequence:

pd.Series([2, 4, 6])

Or data can be a scalar, which is repeated to fill the specified index:

pd.Series(5, index=[100, 200, 300])

Or it can be a dictionary, in which case index defaults to the dictionary keys:

pd.Series({2:'a', 1:'b', 3:'c'})

In each case, the index can be explicitly set to control the order or the subset of keys used:

pd.Series({2:'a', 1:'b', 3:'c'}, index=[1, 2])

The Pandas DataFrame Object

The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. Since we didn’t spend time on NumPy arrays, we’ll focus on the concept of a DataFrame as a specialized dictionary.

DataFrame as Specialized Dictionary

Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

print(states['area'])

Constructing DataFrame Objects

A Pandas DataFrame can be constructed in a variety of ways. Here we’ll explore several examples.

From a single Series object

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

pd.DataFrame(population, columns=['population'])

From a list of dicts

Any list of dictionaries can be made into a DataFrame. Even if some keys in the dictionary are missing, Pandas will fill them in with NaN values (i.e., “Not a Number”)

We’ll use a simple list comprehension to create some data:

data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

From a dictionary of Series objects

As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:

pd.DataFrame({'population': population,
              'area': area})

Data Indexing and Selection

Data Selection in Series

As you saw in the previous chapter, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.

Indexers: loc and iloc

If your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit indices:

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])

# explicit index when indexing
data[1]

# implicit index when slicing
data[1:3]

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.

First, the loc attribute allows indexing and slicing that always references the explicit index:

data.loc[1]

data.loc[1:3]

The iloc attribute allows indexing and slicing that always references the implicit Python-style index:

data.iloc[1]

data.iloc[1:3]

One guiding principle of Python code is that “explicit is better than implicit.” The explicit nature of loc and iloc makes them helpful in maintaining clean and readable code; especially in the case of integer indexes, using them consistently can prevent subtle bugs due to the mixed indexing/slicing convention.

Selecting Columns & Rows in DataFrames

  1. Download this Pokemon Dataset CSV file to use while we learn Pandas operations.
  2. Upload it to your Unit-2-Notes repository.
  3. Load data from the CSV file into a DataFrame:
    pokemon = pd.read_csv('pokemon_data.csv')
    
  4. Check out the DataFrame:
    print(pokemon)
    print(pokemon.columns)
    

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

DataFrame as Dictionary

The first analogy we will consider is the DataFrame as a dictionary of related Series objects. The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

pokemon['Type 1']

Equivalently, we can use attribute-style access with simple string column names:

pokemon.HP

Though this is a useful shorthand, keep in mind that it does not work for all cases! For example: if the column names include whitespace (like 'Type 1'), are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible.

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

# Compute ratio of Attack stat to Special Attack stat
pokemon['Attack Ratio'] = pokemon['Attack'] / pokemon['Sp. Atk']

DataFrame as Two-Dimensional Array

As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array.

We can examine the raw underlying data array using the values attribute:

pokemon.values

When it comes to indexing of a DataFrame object, Pandas again uses the loc and iloc indexers mentioned earlier.

  • Use .iloc when you want to access data by position.
  • Use .loc when you want to access data by label (e.g., Pokémon names).

Using the iloc indexer, we can index the underlying array as if it were a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:

# Read a specific location [R, C]
print(pokemon.iloc[100,1])

# Read several rows
print(pokemon.iloc[25:30])

# Read every row for a certain column
for index, row in pokemon.iterrows():
    print(index, row['Name'])

Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:

grass_types = pokemon.loc[pokemon['Type 1'] == "Grass"]
print(grass_types)

Using String Indices

If you modify your DataFrame to use string indices, such as the Pokémon names, you will likely use .loc more frequently than .iloc. You first need to set the Pokémon names as the index:

poke = pokemon.set_index('Name', inplace=True)

Accessing a specific row and column:

# Example accessing Pikachu's type by name
print(poke.loc['Pikachu', 'Type 1'])

To read multiple rows using .loc, you need to specify a list of Pokémon names:

# Example accessing a range of Pokémon by their names
print(poke.loc[['Squirtle', 'Bulbasaur', 'Charmander']])

When iterating with .iterrows(), it automatically provides the index (now Pokémon names) along with the row:

# Iterate through each row, printing Name - Type
for index, row in poke.iterrows():
    print(index, " - ", row['Type 1'])

⭐️ Glossary

Definitions

Data Frame
Data frames are multidimensional arrays taken from a larger dataset. They are used to implement specific data operations that may not need the entire dataset. (In pandas it is called DataFrame)
Explicit Index
Uses the values (numeric or non-numeric) set as the index. For example, if we set a column or row as the index then we can use values in the row or column as indices in different panda methods.
Implicit Index
Uses the location (numeric) of the indices, similar to the python style of indexing.
Index
An Index is a value that represents a position (address) in the DataFrame or Series.
Series
A series is an array of related data values that share a connecting factor or property.

Keywords

  • import: Import lets programmers use packages, libraries or modules that have already been programmed.

  • <DataFrame>[<string>]: return the series corresponding to the given column ().

  • <DataFrame>[<list of strings>]: returns a given set of columns as a DataFrame.

  • <DataFrame>[<series/list of Boolean>]: If the index in the given list is True then it returns the row from that same index in the DataFrame.

  • <DataFrame>.loc[ ]: Uses explicit indexing to return a DataFrame containing those indices and the values associated with them.

  • <DataFrame>.loc[<string1>:<string2>]: This takes in a range of explicit indices and returns a DataFrame containing those indices and the values associated with them.

  • <DataFrame>.loc[<string>]: Uses an explicit index and return the row(s) for that index value.

  • <DataFrame>.loc[<list/series of strings>]: Returns a new DataFrame containing the labels given in the list of strings.

  • <DataFrame>.iloc[ ]: Uses implicit indexing to return a DataFrame containing those indices and the values associated with them.

  • <DataFrame>.iloc[<index, range of indices>]: This takes in an implicit index (or a range of implicit indices) and returns a DataFrame containing those indices and the values associated with them.

  • <DataFrame>.set_index [<string)>]: Sets an existing column(s) with the name as the index of the ``DataFrame``.

  • <DataFrame>.head(<numeric>): Returns the first element(s). If no parameter () is set then it will return the first five elements.

  • <pandas>.DataFrame(<data>): Used to create a DataFrame with the given data.

  • <pandas>.read_csv(): Used to read a csv file into a DataFrame.

  • <DataFrame>.set_index(<column>): Gets the values of the given column and sets them as indices. The output will be sorted in accending order based on the new indices.

  • <pandas>.to_numeric(): Converts what is inside the parenthesis into neumeric values.

  • <series>.str.startswith(<string>): .str.startswith() (in pandas) checks if a series contains a string(s) that starts with the given prarameter (), and returns a boolean value (True or False).

  • <data frame>.sort_index(): Sorts the different objects in the DataFrame. By default, the DataFrame is sorted based on the first column in accending order.


🛑 When class ends, don’t forget to SAVE YOUR WORK! There are multiple steps to saving in GitHub:

  1. Navigate to the Source Control menu on the LEFT sidebar
  2. Click the button on the LEFT menu
  3. Type a brief commit message at the top of the file that opens, for example: updated main.py
  4. Click the small ✔️ checkmark in the TOP RIGHT corner
  5. Click the button on the LEFT menu
  6. Finally you can close your Codespace!

Acknowledgement

Content on this page is adapted from How to Think Like a Data Scientist on Runestone Academy - Brad Miller, Jacqueline Boggs, and Jan Pearce and Python Data Science Handbook - Jake VanderPlas.