background, learning Python for data analysis has been a bit challenging. The syntax is simpler — true. However, the language and terminology are completely different. In SQL, you’ll have to interact with databases, tables and columns. In Python, however, for data analysis, your bread and butter is going to be data structures.
Data structures in Python are like data storage objects. Python includes several built-in data structures, such as lists, tuples, sets, and dictionaries. All these are used to store and manipulate data. Some are mutable (lists) and some are not (tuples). To learn more about Python data structures, I highly recommend reading the book “Python for Data Analysis” by Wes McKinney. I just started reading it, and I think it’s stellar.
In this article, I’m going to walk you through what a DataFrame is in Pandas and how to create one step by step.
Understand Array fundamentals
There’s a library in Python called NumPy; you might have heard of it. It’s mostly used for mathematical and numerical computations. One of the features it offers is the ability to create arrays. You might be wondering. What the heck is an Array?
An array is similar to a list, except it only stores values of the same data type. Lists, however, can store values of different data types (int, text, boolean, etc). Here’s an example of a list
my_list = [1, “hello”, 3.14, True]
Lists are also mutable. In other words, you can add and remove elements.
Back to arrays. In Numpy, Arrays can be multidimensional — this is called ndarrays (N-dimensional arrays). For instance, let’s import the Numpy library in Python.
import numpy as np
To create a basic array in Numpy, we use the np.array() function. In this function, our array is stored.
arr = np.array([1, 2, 3, 4, 5])
arr
Here’s the result:
array([1, 2, 3, 4, 5])
To check the data type.
type(arr)
We’ll get the data type.
numpy.ndarray
The cool thing about arrays is that you can perform mathematical calculations on them. For instance
arr*2
The result:
array([ 2, 4, 6, 8, 10])
Pretty cool, right?
Now that you know the basics of arrays in Numpy. Let’s dig deeper into N-dimensional arrays.
The array you see above is a 1-dimensional (1D) array. Also known as vector arrays, 1D arrays consist of a sequence of values. Like so, [1,2,3,4,5]
2-dimensional arrays (Matrix) can store 1D arrays as the values. Similar to rows of a table in SQL, each 1D array is like one row of data. The output is like a grid of values. For instance:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr
Output:
[[1 2 3]
[4 5 6]]
3-dimensional arrays (Tensors) can store 2D arrays (matrices). For instance,
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
arr
Output:
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]
An array can have an infinite number of dimensions, depending on the amount of data you want to store.
Creating a dataframe from an array
Now that you’ve gotten the gist about Arrays. Let’s create a DataFrame from one.
First, we’ll have to import the pandas and NumPy libraries
import pandas as pd
import numpy as np
Next, create our Array:
data = np.array([[1, 4], [2, 5], [3, 6]])
Here, I’ve created a 2D Array. Pandas DataFrame can only store 1D and 2D arrays. If you try to pass in a 3D Array, you’ll get an error.
Now that we’ve got our Array. Let’s pass it into our DataFrame. To create a DataFrame, use the pd.DataFrame() function.
# creating the DataFrame
df = pd.DataFrame(data)
# showing the DataFrame
df
Output
0 1
0 1 4
1 2 5
2 3 6
Looking good so far. But it needs a little formatting:
# creating a dataframe
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'],
columns=['col1', 'col2'])
# showing the dataframe
df
Output
col1 col2
row1 1 4
row2 2 5
row3 3 6
Now that’s better. All I did was rename the rows using the index attribute and the columns using the columns attribute.
And there you go, you have your DataFrame. It’s that simple. Let’s explore some more handy ways to create a DataFrame.
Creating a DataFrame from a dictionary
One of the built-in data structures Python offers is dictionaries. Basically, dictionaries are used to store key-value pairs, where all keys must be unique and immutable. It’s represented by curly brackets {}. Here’s an example of a dictionary:
dict = {"name": "John", "age": 30}
Here, the keys are name and age, and the values are Alice and 30. Simple as that. Now, let’s create a DataFrame from a dictionary.
names = ["John", "David", "Jane", "Mary"]
age = [30, 27, 35, 23]
First, I created a list to store multiple names and ages:
dict_names = {'Names': names, 'Age': age}
Next, I stored all the values in a dictionary and created keys for Names and Age.
# Creating the dataframe
df_names = pd.DataFrame(dict_names)
df_names
Above, we have our DataFrame storing the dictionary we created. Here’s the output below:
Names Age
0 John 30
1 David 27
2 Jane 35
3 Mary 23
And there we go, we have a DataFrame created from a dictionary.
Creating a DataFrame from a CSV file
This is probably the method you’ll be using the most. It’s common practice to read CSV files in pandas when trying to do data analysis. Similar to how you open spreadsheets in Excel or import data to SQL. In Python, you read CSVs by using the read_csv() function. Here’s an example:
# reading the csv file
df_exams = pd.read_csv('StudentsPerformance.csv')
In some cases, you’ll have to copy the file path and paste it as:
pd.read_csv(“C:\data\suppliers lists — Sheet1.csv”)
Output:

And there you go!
Wrapping up
Creating DataFrames in pandas might seem complex, but it actually isn’t. In most cases, you’ll probably be reading CSV files anyway. So don’t sweat it. I hope you found this article helpful. Would love to hear your thoughts in the comments. Thanks for reading!
Wanna connect? Feel free to say hi on these platforms
YouTube
Medium



