Image by Author | DALLE-3 & Canva
If you’ve ever had the chance to work with data, you’ve probably come across the need to load JSON files (short for JavaScript Object Notation) into a Pandas DataFrame for further analysis. JSON files store data in a format that is clear for people to read and also simple for computers to understand. However, JSON files can sometimes be complicated to navigate through. Therefore, we load them into a more structured format like DataFrames – that is set up like a spreadsheet with rows and columns.
I will show you two different ways to convert JSON data into a Pandas DataFrame. Before we discuss these methods, let’s suppose this dummy nested JSON file that I’ll use as an example throughout this article.
{
"books": [
{
"title": "One Hundred Years of Solitude",
"author": "Gabriel Garcia Marquez",
"reviews": [
{
"reviewer": {
"name": "Kanwal Mehreen",
"location": "Islamabad, Pakistan"
},
"rating": 4.5,
"comments": "Magical and completely breathtaking!"
},
{
"reviewer": {
"name": "Isabella Martinez",
"location": "Bogotá, Colombia"
},
"rating": 4.7,
"comments": "A marvelous journey through a world of magic."
}
]
},
{
"title": "Things Fall Apart",
"author": "Chinua Achebe",
"reviews": [
{
"reviewer": {
"name": "Zara Khan",
"location": "Lagos, Nigeria"
},
"rating": 4.9,
"comments": "Things Fall Apart is the best of contemporary African literature."
}]}]}
The above-mentioned JSON data represents a list of books, where each book has a title, author, and a list of reviews. Each review, in turn, has a reviewer (with a name and location) and a rating and comments.
Method 1: Using the json.load()
and pd.DataFrame()
functions
The easiest and most straightforward approach is to use the built-in json.load()
function to parse our JSON data. This will convert it into a Python dictionary, and we can then create the DataFrame directly from the resulting Python data structure. However, it has a problem – it can only handle single nested data. So, for the above case, if you only use these steps with this code:
import json
import pandas as pd
#Load the JSON data
with open('books.json','r') as f:
data = json.load(f)
#Create a DataFrame from the JSON data
df = pd.DataFrame(data['books'])
df
Your output might look like this:
Output:
In the reviews column, you can see the entire dictionary. Therefore, if you want the output to appear correctly, you have to manually handle the nested structure. This can be done as follows:
#Create a DataFrame from the nested JSON data
df = pd.DataFrame([
{
'title': book['title'],
'author': book['author'],
'reviewer_name': review['reviewer']['name'],
'reviewer_location': review['reviewer']['location'],
'rating': review['rating'],
'comments': review['comments']
}
for book in data['books']
for review in book['reviews']
])
Updated Output:
Here, we are using list comprehension to create a flat list of dictionaries, where each dictionary contains the book information and the corresponding review. We then create the Pandas DataFrae using this.
However the issue with this approach is that it demands more manual effort to manage the nested structure of the JSON data. So, what now? Do we have any other option?
Totally! I mean, come on. Given that we’re in the 21st century, facing such a problem without a solution seems unrealistic. Let’s see the other approach.
Method 2 (Recommended): Using the json_normalize()
function
The json_normalize()
function from the Pandas library is a better way to manage nested JSON data. It automatically flattens the nested structure of the JSON data, creating a DataFrame from the resulting data. Let’s take a look at the code:
import pandas as pd
import json
#Load the JSON data
with open('books.json', 'r') as f:
data = json.load(f)
#Create the DataFrame using json_normalize()
df = pd.json_normalize(
data=data['books'],
meta=['title', 'author'],
record_path="reviews",
errors="raise"
)
df
Output:
The json_normalize()
function takes the following parameters:
- data: The input data, which can be a list of dictionaries or a single dictionary. In this case, it’s the data dictionary loaded from the JSON file.
- record_path: The path in the JSON data to the records you want to normalize. In this case, it’s the ‘reviews’ key.
- meta: Additional fields to include in the normalized output from the JSON document. In this case, we’re using the ‘title’ and ‘author’ fields. Note that columns in metadata usually appear at the end. This is how this function works. As far as the analysis is concerned, it doesn’t matter, but for some magical reason, you want these columns to appear before. Sorry, but you have to do them manually.
- errors: The error handling strategy, which can be ‘ignore’, ‘raise’, or ‘warn’. We have set it to ‘raise’, so if there are any errors during the normalization process, it will raise an exception.
Wrapping Up
Both of these methods have their own advantages and use cases, and the choice of method depends on the structure and complexity of the JSON data. If the JSON data has a very nested structure, the json_normalize()
function might be the most suitable option, as it can handle the nested data automatically. If the JSON data is relatively simple and flat, the pd.read_json()
function might be the easiest and most straightforward approach.
When dealing with large JSON files, it’s crucial to think about memory usage and performance since loading the whole file into memory might not work. So, you might have to look into other options like streaming the data, lazy loading, or using a more memory-efficient format like Parquet.
Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.