Extract data from JSON in Pandas Dataframe
Data stored in a csv file can often contain JSON objects in one of the fields. The csv data can easily be loaded into a Pandas Dataframe for analysis. Fields of interest in the JSON object can also be extracted into their own column easily using Dataframe.Apply.
Convert JSON data to Dict
JSON data is a name/value pair structure used to transfer and store data. It maps easily to the Python Dictionary structure. However JSON data in a string cannot be used directly in the Dictionary constructor. This is easily overcome by using the built-in json module to load the string as a Python Dictionary.
1json_string = '{"color": "Red", "wavelength": "620 to 750 nm"}'
2error_dict = dict(json_string)
3
4>>> # ValueError: dictionary update sequence element #0 has length 1; 2 is required
1json_string = '{"color": "Red", "wavelength": "620 to 750 nm"}'
2json_dict = json.loads(json_string)
3
4print(json.dumps(json_dict, indent=4))
5# {
6# "color": "Red",
7# "wavelength": "620 to 750 nm"
8# }
Once it is loaded into a Dictionary, elements can be accessed by the keys.
1json_dict.keys()
2# dict_keys(['color', 'wavelength'])
3
4
5json_dict['color']
6# Red
Load csv data with some JSON into Dataframe
Here is sample data where the second column contains JSON objects.
1df1 = pd.dataframe(
2 [
3 ["a", '{"color": "Red", "wavelength": "620 to 750 nm"}', "11"],
4 ["b", '{"color": "Orange", "wavelength": "590 to 620 nm"}', "12"],
5 ["c", '{"color": "Yellow", "wavelength": "570 to 590 nm"}', "13"],
6 ["d", '{"color": "Green", "wavelength": "495 to 570 nm"}', "14"],
7 ["e", '{"color": "Blue", "wavelength": "450 to 495 nm"}', "15"],
8 ["f", '{"color": "Indigo", "wavelength": "425 to 450 nm"}', "16"],
9 ["g", '{"color": "Violet", "wavelength": "380 to 450 nm"}', "17"],
10 ],
11 index=["row 1", "row 2", "row 3", "row 4", "row 5", "row 6", "row 7"],
12 columns=["one", "two", "three"],
13)
Dataframe containing JSON objects in a column
one | two | three | |
---|---|---|---|
row 1 | a | { "color": "Red", "wavelength": "620 to 750 nm" } |
11 |
row 2 | b | { "color": "Orange", "wavelength": "590 to 620 nm" } |
12 |
row 3 | c | { "color": "Yellow", "wavelength": "570 to 590 nm" } |
13 |
row 4 | d | { "color": "Green", "wavelength": "495 to 570 nm" } |
14 |
row 5 | e | { "color": "Blue", "wavelength": "450 to 495 nm" } |
15 |
row 6 | f | { "color": "Indigo", "wavelength": "425 to 450 nm" } |
16 |
row 7 | g | { "color": "Violet", "wavelength": "380 to 450 nm" } |
17 |
Extract an element from the JSON object to a new column
The Dataframe apply method is used to apply a function to all data along an axis of the Dataframe. Apply is used to load the json object in each cell as a dictionary and extract the color property, which is appended to the end of the Dataframe as a new column. The apply method can be amazingly fast even for large datasets.
1df1['color'] = df1.apply(lambda x: json.loads(x['two'])['color'], axis = 1)
Dataframe with Color extracted from the JSON data
one | two | three | color | |
---|---|---|---|---|
row 1 | a | { "color": "Red", "wavelength": "620 to 750 nm" } |
11 | Red |
row 2 | b | { "color": "Orange", "wavelength": "590 to 620 nm" } |
12 | Orange |
row 3 | c | { "color": "Yellow", "wavelength": "570 to 590 nm" } |
13 | Yellow |
row 4 | d | { "color": "Green", "wavelength": "495 to 570 nm" } |
14 | Green |
row 5 | e | { "color": "Blue", "wavelength": "450 to 495 nm" } |
15 | Blue |
row 6 | f | { "color": "Indigo", "wavelength": "425 to 450 nm" } |
16 | Indigo |
row 7 | g | { "color": "Violet", "wavelength": "380 to 450 nm" } |
17 | Violet |
Extract all elements from JSON at once
Here are a number of ways to extract all the elements from json objects at once and
append the data as columns to the Dataframe. The first loads the JSON data twice once
for values and once for keys, this could be improved by defining a function to load
the json and return a pandas series. This is not necessary as the Apply method has a
result_type
parameter that can be set to expand, which will expand list-like
results to columns in the Dataframe.
1df1.apply(lambda x: pd.Series(
2 json.loads(x['two']).values(),
3 index=json.loads(x['two']).keys()),
4 axis=1)
1df1.apply(lambda x: pd.Series(json.loads(x['two'])), axis=1, result_type='expand')
1df1.two.apply(lambda x: pd.Series(json.loads(x)))
All JSON fields extracted to columns
color | wavelength | |
---|---|---|
row 1 | Red | 620 to 750 nm |
row 2 | Orange | 590 to 620 nm |
row 3 | Yellow | 570 to 590 nm |
row 4 | Green | 495 to 570 nm |
row 5 | Blue | 450 to 495 nm |
row 6 | Indigo | 425 to 450 nm |
row 7 | Violet | 380 to 450 nm |
1df1.join(
2 df1['two'].apply(lambda x: pd.Series(json.loads(x)))
3)
All JSON fields extracted to columns and joined to original Dataframe
one | two | three | color | wavelength | |
---|---|---|---|---|---|
row 1 | a | { "color": "Red", "wavelength": "620 to 750 nm" } |
11 | Red | 620 to 750 nm |
row 2 | b | { "color": "Orange", "wavelength": "590 to 620 nm" } |
12 | Orange | 590 to 620 nm |
row 3 | c | { "color": "Yellow", "wavelength": "570 to 590 nm" } |
13 | Yellow | 570 to 590 nm |
row 4 | d | { "color": "Green", "wavelength": "495 to 570 nm" } |
14 | Green | 495 to 570 nm |
row 5 | e | { "color": "Blue", "wavelength": "450 to 495 nm" } |
15 | Blue | 450 to 495 nm |
row 6 | f | { "color": "Indigo", "wavelength": "425 to 450 nm" } |
16 | Indigo | 425 to 450 nm |
row 7 | g | { "color": "Violet", "wavelength": "380 to 450 nm" } |
17 | Violet | 380 to 450 nm |
Extract elements from different JSON objects
One of the advantages of using JSON to store and transfer data is the flexibility of the JSON storage. Each JSON object is an independent collection of name/value pairs and so a collection of JSON objects may contain different elements. The above code works for all the elements in the series of JSON objects can be expanded into separate columns, elements that are not present in a JSON object will automatically be filled in with NaN.
1df1 = pd.dataframe(
2 [
3 ["a", '{"color": "Red", "wavelength": "620 to 750 nm", "RGB": "(255,0,0)", "CMYK": "(0.00, 1.00, 1.00, 0.00)"}', "11"],
4 ["b", '{"color": "Orange", "wavelength": "590 to 620 nm", "CMYK": "(0.00, 0.35, 1.00, 0.00)"}', "12"],
5 ["c", '{"color": "Yellow", "wavelength": "570 to 590 nm", "RGB": "(255,255,0)", "HSL": "(0.17, 1.00, 0.50)"}', "13"],
6 ["d", '{"color": "Green", "wavelength": "495 to 570 nm", "RGB": "(0,250,0)", "HSL": "(0.33, 1.00, 0.50)"}', "14"],
7 ["e", '{"color": "Blue", "wavelength": "450 to 495 nm", "RGB": "(0,0,255)", "CMYK": "(1.00, 1.00, 0.00, 0.00)"}', "15"],
8 ["f", '{"color": "Indigo", "wavelength": "425 to 450 nm", "RGB": "(75,0,130)", "CMYK": "(0.42, 1.00, 0.00, 0.49)"}', "16"],
9 ["g", '{"color": "Violet", "wavelength": "380 to 450 nm", "RGB": "(238,130,238)", "CMYK": "(0.00, 0.45, 0.00, 0.07)", "HSL": "(0.83, 0.76, 0.72)"}', "17"],
10 ],
11 index=["row 1", "row 2", "row 3", "row 4", "row 5", "row 6", "row 7"],
12 columns=["one", "two", "three"],
13)
Dataframe with inconsistent JSON objects in a column
one | two | three | |
---|---|---|---|
row 1 | a | { "color": "Red", "wavelength": "620 to 750 nm", "RGB": "(255,0,0)", "CMYK": "(0.00, 1.00, 1.00, 0.00)" } |
11 |
row 2 | b | { "color": "Orange", "wavelength": "590 to 620 nm", "CMYK": "(0.00, 0.35, 1.00, 0.00)" } |
12 |
row 3 | c | { "color": "Yellow", "wavelength": "570 to 590 nm", "RGB": "(255,255,0)", "HSL": "(0.17, 1.00, 0.50)" } |
13 |
row 4 | d | { "color": "Green", "wavelength": "495 to 570 nm", "RGB": "(0,250,0)", "HSL": "(0.33, 1.00, 0.50)" } |
14 |
row 5 | e | { "color": "Blue", "wavelength": "450 to 495 nm", "RGB": "(0,0,255)", "CMYK": "(1.00, 1.00, 0.00, 0.00)" } |
15 |
row 6 | f | { "color": "Indigo", "wavelength": "425 to 450 nm", "RGB": "(75,0,130)", "CMYK": "(0.42, 1.00, 0.00, 0.49)" } |
16 |
row 7 | g | { "color": "Violet", "wavelength": "380 to 450 nm", "RGB": "(238,130,238)", "CMYK": "(0.00, 0.45, 0.00, 0.07)", "HSL": "(0.83, 0.76, 0.72)" } |
17 |
1df1.join(
2 df1['two'].apply(lambda x: pd.Series(json.loads(x)))
3)
All JSON fields extracted to columns and joined to original Dataframe
one | two | three | color | wavelength | RGB | CMYK | HSL | |
---|---|---|---|---|---|---|---|---|
row 1 | a | { "color": "Red", "wavelength": "620 to 750 nm", "RGB": "(255,0,0)", "CMYK": "(0.00, 1.00, 1.00, 0.00)" } |
11 | Red | 620 to 750 nm | (255,0,0) | (0.00, 1.00, 1.00, 0.00) | nan |
row 2 | b | { "color": "Orange", "wavelength": "590 to 620 nm", "CMYK": "(0.00, 0.35, 1.00, 0.00)" } |
12 | Orange | 590 to 620 nm | nan | (0.00, 0.35, 1.00, 0.00) | nan |
row 3 | c | { "color": "Yellow", "wavelength": "570 to 590 nm", "RGB": "(255,255,0)", "HSL": "(0.17, 1.00, 0.50)" } |
13 | Yellow | 570 to 590 nm | (255,255,0) | nan | (0.17, 1.00, 0.50) |
row 4 | d | { "color": "Green", "wavelength": "495 to 570 nm", "RGB": "(0,250,0)", "HSL": "(0.33, 1.00, 0.50)" } |
14 | Green | 495 to 570 nm | (0,250,0) | nan | (0.33, 1.00, 0.50) |
row 5 | e | { "color": "Blue", "wavelength": "450 to 495 nm", "RGB": "(0,0,255)", "CMYK": "(1.00, 1.00, 0.00, 0.00)" } |
15 | Blue | 450 to 495 nm | (0,0,255) | (1.00, 1.00, 0.00, 0.00) | nan |
row 6 | f | { "color": "Indigo", "wavelength": "425 to 450 nm", "RGB": "(75,0,130)", "CMYK": "(0.42, 1.00, 0.00, 0.49)" } |
16 | Indigo | 425 to 450 nm | (75,0,130) | (0.42, 1.00, 0.00, 0.49) | nan |
row 7 | g | { "color": "Violet", "wavelength": "380 to 450 nm", "RGB": "(238,130,238)", "CMYK": "(0.00, 0.45, 0.00, 0.07)", "HSL": "(0.83, 0.76, 0.72)" } |
17 | Violet | 380 to 450 nm | (238,130,238) | (0.00, 0.45, 0.00, 0.07) | (0.83, 0.76, 0.72) |
1df1.join(
2 df1['two'].apply(lambda x: pd.Series(json.loads(x)))
3).drop(['two'], axis=1)
All JSON fields extracted to columns with original JSON column removed
one | three | color | wavelength | RGB | CMYK | HSL | |
---|---|---|---|---|---|---|---|
row 1 | a | 11 | Red | 620 to 750 nm | (255,0,0) | (0.00, 1.00, 1.00, 0.00) | nan |
row 2 | b | 12 | Orange | 590 to 620 nm | nan | (0.00, 0.35, 1.00, 0.00) | nan |
row 3 | c | 13 | Yellow | 570 to 590 nm | (255,255,0) | nan | (0.17, 1.00, 0.50) |
row 4 | d | 14 | Green | 495 to 570 nm | (0,250,0) | nan | (0.33, 1.00, 0.50) |
row 5 | e | 15 | Blue | 450 to 495 nm | (0,0,255) | (1.00, 1.00, 0.00, 0.00) | nan |
row 6 | f | 16 | Indigo | 425 to 450 nm | (75,0,130) | (0.42, 1.00, 0.00, 0.49) | nan |
row 7 | g | 17 | Violet | 380 to 450 nm | (238,130,238) | (0.00, 0.45, 0.00, 0.07) | (0.83, 0.76, 0.72) |