The Pandas Python library is an extremely powerful tool for graphing, plotting, and data analysis. However, the power (and therefore complexity) of Pandas can often be quite overwhelming, given the myriad of functions, methods, and capabilities the library provides.
In this brief tutorial we’ll explore the basic use of the DataFrame
in Pandas, which is the basic data structure for the entire system, and how to make use of the index and column labels to keep track of the data within the DataFrame
.
Creating a Basic DataFrame
For this tutorial, we need something to work with, so we’ll create a very simple data frame which consists of 3
book titles
and author names
:
pd.DataFrame(
[
(
'The Hobbit',
'J.R.R. Tolkien'
),
(
'Robinson Crusoe',
'Daniel Defoe'
),
(
'Moby-Dick',
'Herman Melville'
)
]
)
Note: Throughout the tutorial the examples will include a great deal of excess spacing. This spacing is not required, but serves to better illustrate the syntax we’re using.
The result of the above DataFrame
creation is a simple 3-row, 2-column table with automatically generated numeric indices and columns:
0 | 1 | |
---|---|---|
0 | The Hobbit | J.R.R. Tolkien |
1 | Robinson Crusoe | Daniel Defoe |
2 | Moby-Dick | Herman Melville |
Adding Columns and Indices
When initially creating a DataFrame
, it is entirely possible to specify the column
and index
labels. To do so, we’ll need to specify values for the data
, index
and columns
parameters:
pd.DataFrame(
data=[
(
'The Hobbit',
'J.R.R. Tolkien'
),
(
'Robinson Crusoe',
'Daniel Defoe'
),
(
'Moby-Dick',
'Herman Melville'
)
],
columns=[
'title',
'author'
],
index=[
'first',
'second',
'third'
]
)
title | author | |
---|---|---|
first | The Hobbit | J.R.R. Tolkien |
second | Robinson Crusoe | Daniel Defoe |
third | Moby-Dick | Herman Melville |
Now we see our data structure has some appropriate index
and column
labels that make a bit of sense. However, what happens when we have an existing DataFrame
and we want to update the column
labels on the fly?
Modifying Column Labels
There are two methods for altering the column
labels: the columns
method and the rename
method.
Using the Columns Method
If we have our labeled DataFrame
already created, the simplest method for overwriting the column
labels is to call the columns
method on the DataFrame
object and provide the new list of names we’d like to specify.
For example, if we take our original DataFrame
:
df = pd.DataFrame(
[
(
'The Hobbit',
'J.R.R. Tolkien'
),
(
'Robinson Crusoe',
'Daniel Defoe'
),
(
'Moby-Dick',
'Herman Melville'
)
]
)
df
0 | 1 | |
---|---|---|
0 | The Hobbit | J.R.R. Tolkien |
1 | Robinson Crusoe | Daniel Defoe |
2 | Moby-Dick | Herman Melville |
We can modify the column
labels by adding the following line:
df.columns = [
'title',
'author'
]
df
title | author | |
---|---|---|
0 | The Hobbit | J.R.R. Tolkien |
1 | Robinson Crusoe | Daniel Defoe |
2 | Moby-Dick | Herman Melville |
Using the Rename Method
The other technique for renaming column
labels is to call the rename
method on the DataFrame
object, then passing our list of label values to the columns
parameter:
df = pd.DataFrame(
[
(
'The Hobbit',
'J.R.R. Tolkien'
),
(
'Robinson Crusoe',
'Daniel Defoe'
),
(
'Moby-Dick',
'Herman Melville'
)
]
)
df.rename(
columns={
0 : 'title',
1 : 'author'
},
inplace=True
)
df
| |title|author| |—–|—–|—–| |0|The Hobbit|J.R.R. Tolkien| |1|Robinson Crusoe|Daniel Defoe| |2|Moby-Dick|Herman Melville|
It’s important to note that since the rename
method is attempting to actually rename existing labels, you do need to specify the existing label first followed by the new label to rename it to afterward, as shown in the example above.
Also, we specify the True
value for the inplace
parameter here because we want to update the existing DataFrame
, rather than to have this function call return a newly created DataFrame
instead.