8 Ways to Create (Initialize) Pandas Data Frames
One of the key tasks for data scientists and computer programmers is to read, write, organize, and manipulate data. Perhaps the most intuitive format in which these data are stored is the tabular format. This format organizes data into tables, which are also commonly referred to as data frames or spreadsheets. Because of data, programs like Excel and Google Sheets are must-learn software for almost everyone in every occupation. However, when data get big you need a way to programmatically handle all that information. That’s where pandas
comes in.
pandas
is a Python module for creating, organizing, and writing tabular data. It has a ton of functionality and is a go-to tool for most data scientists and computer programmers. If you’re new to pandas
there can be a bit of a learning curve to getting started. This tutorial is designed to help you understand the basics of creating pandas
data frames so that you can eventually become a data wizard!
Setup¶
Before you start, make sure you have the pandas
Python module installed. You can easily do this with pip
or conda
.
To read from Excel files with pandas
you’ll also need to install the openpyxl
module. It is also easily installed with pip
or conda
.
Now import pandas
and we’ll get started!
import pandas as pd
1. Empty Pandas Data Frame¶
To start, let’s create an empty pandas
data frame (df
). There will be no data in this data frame, it will just be an empty pandas.DataFrame
object. We will add data to it later.
df = pd.DataFrame()
type(df)
pandas.core.frame.DataFrame
As you can see, this data frame is empty. Creating an empty data frame is something that you might need to do regularly. For example, you might be creating data frames as data become available, and the types and shapes of those data may vary.
Add Data to an Empty Data Frame¶
An empty data frame isn’t useful unless we add data to it. That’s what we’ll do here. Once we have a data frame we can easily create columns and assign data values to the columns.
Below we create two lists, one contains the names of people, the other contains their age. These data are assigned to columns by referening the column we want to create and assigning the list with the associated data.
name = ['Jill', 'Jesse', 'Jane', 'Jacob']
age = [34, 67, 78, 45]
df['name'] = name
df['age'] = age
df
name | age | |
---|---|---|
0 | Jill | 34 |
1 | Jesse | 67 |
2 | Jane | 78 |
3 | Jacob | 45 |
2. Data Frame from a List of Values¶
Instead of creating an empty data frame then assigning lists to columns we can instead create a pandas
data frame directly from a list. Below is a list of integer data. We create a data frame from this list by passing the list to pandas.DataFrame
and specifying a column name. Notice that the column name is specified as a list.
data = [1, 2, 3, 4]
df = pd.DataFrame(data, columns=['ColumnName'])
df
ColumnName | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
This is a very simple example. Let’s examine how we can generate a data frame from a more realisitic list.
3. Data Frame from a List of Lists¶
Going back to our initial example data that contain names and ages, we can represent these data as a list of lists. The first level of the list contains a series of sublists. Each sublist contains the name and age of a person. A data frame is initialized from these data by passing the data
variable to pandas.DataFrame
along with the column names.
data = [['Jill', 34], ['Jesse', 67], ['Jane', 78], ['Jacob', 45]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
df
You can see that the resulting data frame is identical to the one we created by adding lists to a empty data frame to create columns. This method is useful if you have previously generated data that you wish to convert into a data frame.
4. Data Frame from a NumPy Array¶
when working with numeric data in Python it is very common that you will have data represented in NumPy arrays. Pandas data frames can easily be intialized from one and two dimensional NumPy arrays.
A 1D NumPy array will be treated like a column of data values by pandas.
A 2D NumPy array will be treated as columns and rows.
As with the other data frame initialization examples, simply pass the NumPy array to pandas.DataFrame
and specify the column names in a list. Note that the number column names must match the number of columns in the input array.
import numpy as np
data = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
df = pd.DataFrame(data, columns=['Column1', 'Column2', 'Column3'])
df
Column1 | Column2 | Column3 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
5. Data Frame from Dictionary¶
One of the most common ways to intialize a pandas data frame is with a dictionary. A dictionary links a key to a value. Values can be individual strings or numbers, or lists. When you initialize a pandas data frame from a dictionary the keys will become the column names and the values will the dictionary values associated with the key.
The code below creates the initial data frame using a dictionary. Notice that when you create a data frame from a dictionary you do not need to specify columns because the dictionary keys become the column names. However, if you wish to have different column names than the dictionary keys you can still pass a list of column names.
df = pd.DataFrame({
'name': ['Jill', 'Jesse', 'Jane', 'Jacob'],
'age': [34, 67, 78, 45]})
df
name | age | |
---|---|---|
0 | Jill | 34 |
1 | Jesse | 67 |
2 | Jane | 78 |
3 | Jacob | 45 |
6. Data Frame from a list of Dictionaries¶
In many Python applications it is common for data to be represented as a list of dictionaries instead of dictionaries where a key value contains a list (this is how the data were organized in the last example). Pandas will also automatically create a data frame from a list of dictionaries. Just pass the list to pandas.DataFrame
.
As before, the key names will become the column names. You’ll want to be sure that all the dictionaries in the list contain the same keys, otherwise you will get an error.
dicts = [
{'name': 'Jill', 'age': 34},
{'name': 'Jesse', 'age': 67},
{'name': 'Jane', 'age': 78},
{'name': 'Jacob', 'age': 45}
]
df = pd.DataFrame(dicts)
df
name | age | |
---|---|---|
0 | Jill | 34 |
1 | Jesse | 67 |
2 | Jane | 78 |
3 | Jacob | 45 |
7. Data Frame from a CSV (or other tabular format)¶
Perhaps the most common way to use pandas
is to read tabular data that have already been compiled. There is pandas
support for nearly every type of tabular data. For this example, we’ll focus on comma-separated values because it is one of the more common formats.
Before we can read a csv we need to have one to read. Let’s save our previous data frame as a csv, then we’ll read it back in.
To save the data frame as a csv simply use pandas.to_csv
and specify the file name. I also like to set index
to False
so that the row numbers are not saved.
df.to_csv('../data/mycsv.csv', index=False)
Check to make sure the csv was saved properly (you can open it in any text or code editor or in Excel or another spreadsheet program).
Now let’s read the csv into Python with pandas.read_csv
. We just need to specify the file name and pass it to read_csv
. Super simple.
fn = '../data/mycsv.csv'
df = pd.read_csv(fn)
df
name | age | |
---|---|---|
0 | Jill | 34 |
1 | Jesse | 67 |
2 | Jane | 78 |
3 | Jacob | 45 |
That was easy! And, as you can see, the data frame we read from the csv matches the data frame we wrote to csv, as it should!
8. Data Frame from Excel¶
Finally, let’s explore how we can read data from an Excel file into a pandas
data frame. Excel files are formatted differently than delimited text files (like the csv we created), even though both types of files can be easily viewed with Excel. The major differences are that Excel files can contain multiple sheets and the files are written/formatted differently because they contain equations and not just data values.
You should have followed the directions at the beginning of the article to install openpyxl
. Do that now if you haven’t already. openpyxl
is required to run the following code.
I’m working with an Excel file that contains two sheets, which are creatively named ‘Sheet1’ and ‘Sheet2’. Sheet1 contains the same data we’ve been working with that contains the name and age for people. Sheet2 contains information about an occupation and the associated salary (made up data).
Let’s start out by reading Sheet1 as a pandas
data frame.
fn = '../data/myexcel.xlsx'
df = pd.read_excel(fn)
df
Name | Age | |
---|---|---|
0 | Jill | 34 |
1 | Jesse | 67 |
2 | Jane | 78 |
3 | Jacob | 45 |
As you can see from the resulting data frame, only the first sheet of the file was read (even though I didn’t specify a sheet). Below is a more specific way to write the same code so that I know which sheet I am getting back. I’ve just specified the sheet_name
argument.
df = pd.read_excel(fn, sheet_name='Sheet1')
df
Name | Age | |
---|---|---|
0 | Jill | 34 |
1 | Jesse | 67 |
2 | Jane | 78 |
3 | Jacob | 45 |
As you can see, we get the same answer but the code is now more understandable.
Let’s use this same method to read the second sheet.
df = pd.read_excel(fn, sheet_name='Sheet2')
df
Occupation | Salary | |
---|---|---|
0 | Secretary | 60000 |
1 | Doctor | 150000 |
2 | Programmer | 120000 |
3 | Attorney | 170000 |
And there is the second sheet, just as it should be. That is how you can read Excel files as a pandas
data frame. Note that you can also read spreadsheets from other formats like odf, ods, and odt. Just read the pandas documentation to be sure you have the proper supporting modules installed.
Conclusion
There is so much information stored in tabular data that you need programming skills to access and evaluate it properly. The pandas
module for Python is one of the best tools available to interact with these data. This tutorial has demonstrated 8 different methods you can use to create pandas
arrays that will help you access, store, and organize your data. Hopefully, this tutorial has helped you move along your path to becoming a verified data wizard!