Extract data from JSON in Pandas Dataframe

Jul 24, 2022 · 8 min read · pandas dataframe json python ·

Data stored in a csv file can often contain JSON objects in one of the fields. The csv data can easily be loaded into a Pandas Dataframe for analysis. Fields of interest in the JSON object can also be extracted into their own column easily using Dataframe.Apply.

Convert JSON data to Dict

JSON data is a name/value pair structure used to transfer and store data. It maps easily to the Python Dictionary structure. However JSON data in a string cannot be used directly in the Dictionary constructor. This is easily overcome by using the built-in json module to load the string as a Python Dictionary.

1json_string = '{"color": "Red", "wavelength": "620 to 750 nm"}'
2error_dict = dict(json_string)
3
4>>> # ValueError: dictionary update sequence element #0 has length 1; 2 is required

1json_string = '{"color": "Red", "wavelength": "620 to 750 nm"}'
2json_dict = json.loads(json_string)
3
4print(json.dumps(json_dict, indent=4))
5# {
6#     "color": "Red",
7#     "wavelength": "620 to 750 nm"
8# }

Once it is loaded into a Dictionary, elements can be accessed by the keys.

1json_dict.keys()
2# dict_keys(['color', 'wavelength'])
3
4
5json_dict['color']
6# Red

Load csv data with some JSON into Dataframe

Here is sample data where the second column contains JSON objects.

 1df1 = pd.dataframe(
 2    [
 3        ["a", '{"color": "Red",    "wavelength": "620 to 750 nm"}', "11"],
 4        ["b", '{"color": "Orange", "wavelength": "590 to 620 nm"}', "12"],
 5        ["c", '{"color": "Yellow", "wavelength": "570 to 590 nm"}', "13"],
 6        ["d", '{"color": "Green",  "wavelength": "495 to 570 nm"}', "14"],
 7        ["e", '{"color": "Blue",   "wavelength": "450 to 495 nm"}', "15"],
 8        ["f", '{"color": "Indigo", "wavelength": "425 to 450 nm"}', "16"],
 9        ["g", '{"color": "Violet", "wavelength": "380 to 450 nm"}', "17"],
10    ],
11    index=["row 1", "row 2", "row 3", "row 4", "row 5", "row 6", "row 7"],
12    columns=["one", "two", "three"],
13)

Dataframe containing JSON objects in a column

	one	two	three
row 1	a	{ "color": "Red", "wavelength": "620 to 750 nm" }	11
row 2	b	{ "color": "Orange", "wavelength": "590 to 620 nm" }	12
row 3	c	{ "color": "Yellow", "wavelength": "570 to 590 nm" }	13
row 4	d	{ "color": "Green", "wavelength": "495 to 570 nm" }	14
row 5	e	{ "color": "Blue", "wavelength": "450 to 495 nm" }	15
row 6	f	{ "color": "Indigo", "wavelength": "425 to 450 nm" }	16
row 7	g	{ "color": "Violet", "wavelength": "380 to 450 nm" }	17

Extract an element from the JSON object to a new column

The Dataframe apply method is used to apply a function to all data along an axis of the Dataframe. Apply is used to load the json object in each cell as a dictionary and extract the color property, which is appended to the end of the Dataframe as a new column. The apply method can be amazingly fast even for large datasets.

1df1['color'] = df1.apply(lambda x: json.loads(x['two'])['color'], axis = 1)

Dataframe with Color extracted from the JSON data

	one	two	three	color
row 1	a	{ "color": "Red", "wavelength": "620 to 750 nm" }	11	Red
row 2	b	{ "color": "Orange", "wavelength": "590 to 620 nm" }	12	Orange
row 3	c	{ "color": "Yellow", "wavelength": "570 to 590 nm" }	13	Yellow
row 4	d	{ "color": "Green", "wavelength": "495 to 570 nm" }	14	Green
row 5	e	{ "color": "Blue", "wavelength": "450 to 495 nm" }	15	Blue
row 6	f	{ "color": "Indigo", "wavelength": "425 to 450 nm" }	16	Indigo
row 7	g	{ "color": "Violet", "wavelength": "380 to 450 nm" }	17	Violet

Extract all elements from JSON at once

Here are a number of ways to extract all the elements from json objects at once and append the data as columns to the Dataframe. The first loads the JSON data twice once for values and once for keys, this could be improved by defining a function to load the json and return a pandas series. This is not necessary as the Apply method has a result_type parameter that can be set to expand, which will expand list-like results to columns in the Dataframe.

1df1.apply(lambda x: pd.Series(
2    json.loads(x['two']).values(),
3    index=json.loads(x['two']).keys()), 
4    axis=1)

1df1.apply(lambda x: pd.Series(json.loads(x['two'])), axis=1, result_type='expand')

1df1.two.apply(lambda x: pd.Series(json.loads(x)))

All JSON fields extracted to columns

	color	wavelength
row 1	Red	620 to 750 nm
row 2	Orange	590 to 620 nm
row 3	Yellow	570 to 590 nm
row 4	Green	495 to 570 nm
row 5	Blue	450 to 495 nm
row 6	Indigo	425 to 450 nm
row 7	Violet	380 to 450 nm

1df1.join(
2    df1['two'].apply(lambda x: pd.Series(json.loads(x)))
3)

All JSON fields extracted to columns and joined to original Dataframe

	one	two	three	color	wavelength
row 1	a	{ "color": "Red", "wavelength": "620 to 750 nm" }	11	Red	620 to 750 nm
row 2	b	{ "color": "Orange", "wavelength": "590 to 620 nm" }	12	Orange	590 to 620 nm
row 3	c	{ "color": "Yellow", "wavelength": "570 to 590 nm" }	13	Yellow	570 to 590 nm
row 4	d	{ "color": "Green", "wavelength": "495 to 570 nm" }	14	Green	495 to 570 nm
row 5	e	{ "color": "Blue", "wavelength": "450 to 495 nm" }	15	Blue	450 to 495 nm
row 6	f	{ "color": "Indigo", "wavelength": "425 to 450 nm" }	16	Indigo	425 to 450 nm
row 7	g	{ "color": "Violet", "wavelength": "380 to 450 nm" }	17	Violet	380 to 450 nm

Extract elements from different JSON objects

One of the advantages of using JSON to store and transfer data is the flexibility of the JSON storage. Each JSON object is an independent collection of name/value pairs and so a collection of JSON objects may contain different elements. The above code works for all the elements in the series of JSON objects can be expanded into separate columns, elements that are not present in a JSON object will automatically be filled in with NaN.

 1df1 = pd.dataframe(
 2    [
 3        ["a", '{"color": "Red",    "wavelength": "620 to 750 nm", "RGB": "(255,0,0)", "CMYK": "(0.00, 1.00, 1.00, 0.00)"}', "11"],
 4        ["b", '{"color": "Orange", "wavelength": "590 to 620 nm", "CMYK": "(0.00, 0.35, 1.00, 0.00)"}', "12"],
 5        ["c", '{"color": "Yellow", "wavelength": "570 to 590 nm", "RGB": "(255,255,0)", "HSL": "(0.17, 1.00, 0.50)"}', "13"],
 6        ["d", '{"color": "Green",  "wavelength": "495 to 570 nm", "RGB": "(0,250,0)", "HSL": "(0.33, 1.00, 0.50)"}', "14"],
 7        ["e", '{"color": "Blue",   "wavelength": "450 to 495 nm", "RGB": "(0,0,255)", "CMYK": "(1.00, 1.00, 0.00, 0.00)"}', "15"],
 8        ["f", '{"color": "Indigo", "wavelength": "425 to 450 nm", "RGB": "(75,0,130)", "CMYK": "(0.42, 1.00, 0.00, 0.49)"}', "16"],
 9        ["g", '{"color": "Violet", "wavelength": "380 to 450 nm", "RGB": "(238,130,238)", "CMYK": "(0.00, 0.45, 0.00, 0.07)", "HSL": "(0.83, 0.76, 0.72)"}', "17"],
10    ],
11    index=["row 1", "row 2", "row 3", "row 4", "row 5", "row 6", "row 7"],
12    columns=["one", "two", "three"],
13)

Dataframe with inconsistent JSON objects in a column

	one	two	three
row 1	a	{ "color": "Red", "wavelength": "620 to 750 nm", "RGB": "(255,0,0)", "CMYK": "(0.00, 1.00, 1.00, 0.00)" }	11
row 2	b	{ "color": "Orange", "wavelength": "590 to 620 nm", "CMYK": "(0.00, 0.35, 1.00, 0.00)" }	12
row 3	c	{ "color": "Yellow", "wavelength": "570 to 590 nm", "RGB": "(255,255,0)", "HSL": "(0.17, 1.00, 0.50)" }	13
row 4	d	{ "color": "Green", "wavelength": "495 to 570 nm", "RGB": "(0,250,0)", "HSL": "(0.33, 1.00, 0.50)" }	14
row 5	e	{ "color": "Blue", "wavelength": "450 to 495 nm", "RGB": "(0,0,255)", "CMYK": "(1.00, 1.00, 0.00, 0.00)" }	15
row 6	f	{ "color": "Indigo", "wavelength": "425 to 450 nm", "RGB": "(75,0,130)", "CMYK": "(0.42, 1.00, 0.00, 0.49)" }	16
row 7	g	{ "color": "Violet", "wavelength": "380 to 450 nm", "RGB": "(238,130,238)", "CMYK": "(0.00, 0.45, 0.00, 0.07)", "HSL": "(0.83, 0.76, 0.72)" }	17

1df1.join(
2    df1['two'].apply(lambda x: pd.Series(json.loads(x)))
3)

All JSON fields extracted to columns and joined to original Dataframe

	one	two	three	color	wavelength	RGB	CMYK	HSL
row 1	a	{ "color": "Red", "wavelength": "620 to 750 nm", "RGB": "(255,0,0)", "CMYK": "(0.00, 1.00, 1.00, 0.00)" }	11	Red	620 to 750 nm	(255,0,0)	(0.00, 1.00, 1.00, 0.00)	nan
row 2	b	{ "color": "Orange", "wavelength": "590 to 620 nm", "CMYK": "(0.00, 0.35, 1.00, 0.00)" }	12	Orange	590 to 620 nm	nan	(0.00, 0.35, 1.00, 0.00)	nan
row 3	c	{ "color": "Yellow", "wavelength": "570 to 590 nm", "RGB": "(255,255,0)", "HSL": "(0.17, 1.00, 0.50)" }	13	Yellow	570 to 590 nm	(255,255,0)	nan	(0.17, 1.00, 0.50)
row 4	d	{ "color": "Green", "wavelength": "495 to 570 nm", "RGB": "(0,250,0)", "HSL": "(0.33, 1.00, 0.50)" }	14	Green	495 to 570 nm	(0,250,0)	nan	(0.33, 1.00, 0.50)
row 5	e	{ "color": "Blue", "wavelength": "450 to 495 nm", "RGB": "(0,0,255)", "CMYK": "(1.00, 1.00, 0.00, 0.00)" }	15	Blue	450 to 495 nm	(0,0,255)	(1.00, 1.00, 0.00, 0.00)	nan
row 6	f	{ "color": "Indigo", "wavelength": "425 to 450 nm", "RGB": "(75,0,130)", "CMYK": "(0.42, 1.00, 0.00, 0.49)" }	16	Indigo	425 to 450 nm	(75,0,130)	(0.42, 1.00, 0.00, 0.49)	nan
row 7	g	{ "color": "Violet", "wavelength": "380 to 450 nm", "RGB": "(238,130,238)", "CMYK": "(0.00, 0.45, 0.00, 0.07)", "HSL": "(0.83, 0.76, 0.72)" }	17	Violet	380 to 450 nm	(238,130,238)	(0.00, 0.45, 0.00, 0.07)	(0.83, 0.76, 0.72)

1df1.join(
2    df1['two'].apply(lambda x: pd.Series(json.loads(x)))
3).drop(['two'], axis=1)

All JSON fields extracted to columns with original JSON column removed

	one	three	color	wavelength	RGB	CMYK	HSL
row 1	a	11	Red	620 to 750 nm	(255,0,0)	(0.00, 1.00, 1.00, 0.00)	nan
row 2	b	12	Orange	590 to 620 nm	nan	(0.00, 0.35, 1.00, 0.00)	nan
row 3	c	13	Yellow	570 to 590 nm	(255,255,0)	nan	(0.17, 1.00, 0.50)
row 4	d	14	Green	495 to 570 nm	(0,250,0)	nan	(0.33, 1.00, 0.50)
row 5	e	15	Blue	450 to 495 nm	(0,0,255)	(1.00, 1.00, 0.00, 0.00)	nan
row 6	f	16	Indigo	425 to 450 nm	(75,0,130)	(0.42, 1.00, 0.00, 0.49)	nan
row 7	g	17	Violet	380 to 450 nm	(238,130,238)	(0.00, 0.45, 0.00, 0.07)	(0.83, 0.76, 0.72)