Extract data from JSON in Pandas Dataframe

Extract data from JSON in Pandas Dataframe

Data stored in a csv file can often contain JSON objects in one of the fields. The csv data can easily be loaded into a Pandas Dataframe for analysis. Fields of interest in the JSON object can also be extracted into their own column easily using Dataframe.Apply.



Convert JSON data to Dict

JSON data is a name/value pair structure used to transfer and store data. It maps easily to the Python Dictionary structure. However JSON data in a string cannot be used directly in the Dictionary constructor. This is easily overcome by using the built-in json module to load the string as a Python Dictionary.

1json_string = '{"color": "Red", "wavelength": "620 to 750 nm"}'
2error_dict = dict(json_string)
3
4>>> # ValueError: dictionary update sequence element #0 has length 1; 2 is required
1json_string = '{"color": "Red", "wavelength": "620 to 750 nm"}'
2json_dict = json.loads(json_string)
3
4print(json.dumps(json_dict, indent=4))
5# {
6#     "color": "Red",
7#     "wavelength": "620 to 750 nm"
8# }

Once it is loaded into a Dictionary, elements can be accessed by the keys.

1json_dict.keys()
2# dict_keys(['color', 'wavelength'])
3
4
5json_dict['color']
6# Red


Load csv data with some JSON into Dataframe

Here is sample data where the second column contains JSON objects.

 1df1 = pd.dataframe(
 2    [
 3        ["a", '{"color": "Red",    "wavelength": "620 to 750 nm"}', "11"],
 4        ["b", '{"color": "Orange", "wavelength": "590 to 620 nm"}', "12"],
 5        ["c", '{"color": "Yellow", "wavelength": "570 to 590 nm"}', "13"],
 6        ["d", '{"color": "Green",  "wavelength": "495 to 570 nm"}', "14"],
 7        ["e", '{"color": "Blue",   "wavelength": "450 to 495 nm"}', "15"],
 8        ["f", '{"color": "Indigo", "wavelength": "425 to 450 nm"}', "16"],
 9        ["g", '{"color": "Violet", "wavelength": "380 to 450 nm"}', "17"],
10    ],
11    index=["row 1", "row 2", "row 3", "row 4", "row 5", "row 6", "row 7"],
12    columns=["one", "two", "three"],
13)

Dataframe containing JSON objects in a column

one two three
row 1 a {
    "color": "Red",
    "wavelength": "620 to 750 nm"
}
11
row 2 b {
    "color": "Orange",
    "wavelength": "590 to 620 nm"
}
12
row 3 c {
    "color": "Yellow",
    "wavelength": "570 to 590 nm"
}
13
row 4 d {
    "color": "Green",
    "wavelength": "495 to 570 nm"
}
14
row 5 e {
    "color": "Blue",
    "wavelength": "450 to 495 nm"
}
15
row 6 f {
    "color": "Indigo",
    "wavelength": "425 to 450 nm"
}
16
row 7 g {
    "color": "Violet",
    "wavelength": "380 to 450 nm"
}
17


Extract an element from the JSON object to a new column

The Dataframe apply method is used to apply a function to all data along an axis of the Dataframe. Apply is used to load the json object in each cell as a dictionary and extract the color property, which is appended to the end of the Dataframe as a new column. The apply method can be amazingly fast even for large datasets.

1df1['color'] = df1.apply(lambda x: json.loads(x['two'])['color'], axis = 1)

Dataframe with Color extracted from the JSON data

one two three color
row 1 a {
    "color": "Red",
    "wavelength": "620 to 750 nm"
}
11 Red
row 2 b {
    "color": "Orange",
    "wavelength": "590 to 620 nm"
}
12 Orange
row 3 c {
    "color": "Yellow",
    "wavelength": "570 to 590 nm"
}
13 Yellow
row 4 d {
    "color": "Green",
    "wavelength": "495 to 570 nm"
}
14 Green
row 5 e {
    "color": "Blue",
    "wavelength": "450 to 495 nm"
}
15 Blue
row 6 f {
    "color": "Indigo",
    "wavelength": "425 to 450 nm"
}
16 Indigo
row 7 g {
    "color": "Violet",
    "wavelength": "380 to 450 nm"
}
17 Violet


Extract all elements from JSON at once

Here are a number of ways to extract all the elements from json objects at once and append the data as columns to the Dataframe. The first loads the JSON data twice once for values and once for keys, this could be improved by defining a function to load the json and return a pandas series. This is not necessary as the Apply method has a result_type parameter that can be set to expand, which will expand list-like results to columns in the Dataframe.

1df1.apply(lambda x: pd.Series(
2    json.loads(x['two']).values(),
3    index=json.loads(x['two']).keys()), 
4    axis=1)
1df1.apply(lambda x: pd.Series(json.loads(x['two'])), axis=1, result_type='expand')
1df1.two.apply(lambda x: pd.Series(json.loads(x)))

All JSON fields extracted to columns

color wavelength
row 1 Red 620 to 750 nm
row 2 Orange 590 to 620 nm
row 3 Yellow 570 to 590 nm
row 4 Green 495 to 570 nm
row 5 Blue 450 to 495 nm
row 6 Indigo 425 to 450 nm
row 7 Violet 380 to 450 nm
1df1.join(
2    df1['two'].apply(lambda x: pd.Series(json.loads(x)))
3)

All JSON fields extracted to columns and joined to original Dataframe

one two three color wavelength
row 1 a {
    "color": "Red",
    "wavelength": "620 to 750 nm"
}
11 Red 620 to 750 nm
row 2 b {
    "color": "Orange",
    "wavelength": "590 to 620 nm"
}
12 Orange 590 to 620 nm
row 3 c {
    "color": "Yellow",
    "wavelength": "570 to 590 nm"
}
13 Yellow 570 to 590 nm
row 4 d {
    "color": "Green",
    "wavelength": "495 to 570 nm"
}
14 Green 495 to 570 nm
row 5 e {
    "color": "Blue",
    "wavelength": "450 to 495 nm"
}
15 Blue 450 to 495 nm
row 6 f {
    "color": "Indigo",
    "wavelength": "425 to 450 nm"
}
16 Indigo 425 to 450 nm
row 7 g {
    "color": "Violet",
    "wavelength": "380 to 450 nm"
}
17 Violet 380 to 450 nm


Extract elements from different JSON objects

One of the advantages of using JSON to store and transfer data is the flexibility of the JSON storage. Each JSON object is an independent collection of name/value pairs and so a collection of JSON objects may contain different elements. The above code works for all the elements in the series of JSON objects can be expanded into separate columns, elements that are not present in a JSON object will automatically be filled in with NaN.

 1df1 = pd.dataframe(
 2    [
 3        ["a", '{"color": "Red",    "wavelength": "620 to 750 nm", "RGB": "(255,0,0)", "CMYK": "(0.00, 1.00, 1.00, 0.00)"}', "11"],
 4        ["b", '{"color": "Orange", "wavelength": "590 to 620 nm", "CMYK": "(0.00, 0.35, 1.00, 0.00)"}', "12"],
 5        ["c", '{"color": "Yellow", "wavelength": "570 to 590 nm", "RGB": "(255,255,0)", "HSL": "(0.17, 1.00, 0.50)"}', "13"],
 6        ["d", '{"color": "Green",  "wavelength": "495 to 570 nm", "RGB": "(0,250,0)", "HSL": "(0.33, 1.00, 0.50)"}', "14"],
 7        ["e", '{"color": "Blue",   "wavelength": "450 to 495 nm", "RGB": "(0,0,255)", "CMYK": "(1.00, 1.00, 0.00, 0.00)"}', "15"],
 8        ["f", '{"color": "Indigo", "wavelength": "425 to 450 nm", "RGB": "(75,0,130)", "CMYK": "(0.42, 1.00, 0.00, 0.49)"}', "16"],
 9        ["g", '{"color": "Violet", "wavelength": "380 to 450 nm", "RGB": "(238,130,238)", "CMYK": "(0.00, 0.45, 0.00, 0.07)", "HSL": "(0.83, 0.76, 0.72)"}', "17"],
10    ],
11    index=["row 1", "row 2", "row 3", "row 4", "row 5", "row 6", "row 7"],
12    columns=["one", "two", "three"],
13)

Dataframe with inconsistent JSON objects in a column

one two three
row 1 a {
    "color": "Red",
    "wavelength": "620 to 750 nm",
    "RGB": "(255,0,0)",
    "CMYK": "(0.00, 1.00, 1.00, 0.00)"
}
11
row 2 b {
    "color": "Orange",
    "wavelength": "590 to 620 nm",
    "CMYK": "(0.00, 0.35, 1.00, 0.00)"
}
12
row 3 c {
    "color": "Yellow",
    "wavelength": "570 to 590 nm",
    "RGB": "(255,255,0)",
    "HSL": "(0.17, 1.00, 0.50)"
}
13
row 4 d {
    "color": "Green",
    "wavelength": "495 to 570 nm",
    "RGB": "(0,250,0)",
    "HSL": "(0.33, 1.00, 0.50)"
}
14
row 5 e {
    "color": "Blue",
    "wavelength": "450 to 495 nm",
    "RGB": "(0,0,255)",
    "CMYK": "(1.00, 1.00, 0.00, 0.00)"
}
15
row 6 f {
    "color": "Indigo",
    "wavelength": "425 to 450 nm",
    "RGB": "(75,0,130)",
    "CMYK": "(0.42, 1.00, 0.00, 0.49)"
}
16
row 7 g {
    "color": "Violet",
    "wavelength": "380 to 450 nm",
    "RGB": "(238,130,238)",
    "CMYK": "(0.00, 0.45, 0.00, 0.07)",
    "HSL": "(0.83, 0.76, 0.72)"
}
17
1df1.join(
2    df1['two'].apply(lambda x: pd.Series(json.loads(x)))
3)

All JSON fields extracted to columns and joined to original Dataframe

one two three color wavelength RGB CMYK HSL
row 1 a {
    "color": "Red",
    "wavelength": "620 to 750 nm",
    "RGB": "(255,0,0)",
    "CMYK": "(0.00, 1.00, 1.00, 0.00)"
}
11 Red 620 to 750 nm (255,0,0) (0.00, 1.00, 1.00, 0.00) nan
row 2 b {
    "color": "Orange",
    "wavelength": "590 to 620 nm",
    "CMYK": "(0.00, 0.35, 1.00, 0.00)"
}
12 Orange 590 to 620 nm nan (0.00, 0.35, 1.00, 0.00) nan
row 3 c {
    "color": "Yellow",
    "wavelength": "570 to 590 nm",
    "RGB": "(255,255,0)",
    "HSL": "(0.17, 1.00, 0.50)"
}
13 Yellow 570 to 590 nm (255,255,0) nan (0.17, 1.00, 0.50)
row 4 d {
    "color": "Green",
    "wavelength": "495 to 570 nm",
    "RGB": "(0,250,0)",
    "HSL": "(0.33, 1.00, 0.50)"
}
14 Green 495 to 570 nm (0,250,0) nan (0.33, 1.00, 0.50)
row 5 e {
    "color": "Blue",
    "wavelength": "450 to 495 nm",
    "RGB": "(0,0,255)",
    "CMYK": "(1.00, 1.00, 0.00, 0.00)"
}
15 Blue 450 to 495 nm (0,0,255) (1.00, 1.00, 0.00, 0.00) nan
row 6 f {
    "color": "Indigo",
    "wavelength": "425 to 450 nm",
    "RGB": "(75,0,130)",
    "CMYK": "(0.42, 1.00, 0.00, 0.49)"
}
16 Indigo 425 to 450 nm (75,0,130) (0.42, 1.00, 0.00, 0.49) nan
row 7 g {
    "color": "Violet",
    "wavelength": "380 to 450 nm",
    "RGB": "(238,130,238)",
    "CMYK": "(0.00, 0.45, 0.00, 0.07)",
    "HSL": "(0.83, 0.76, 0.72)"
}
17 Violet 380 to 450 nm (238,130,238) (0.00, 0.45, 0.00, 0.07) (0.83, 0.76, 0.72)
1df1.join(
2    df1['two'].apply(lambda x: pd.Series(json.loads(x)))
3).drop(['two'], axis=1)

All JSON fields extracted to columns with original JSON column removed

one three color wavelength RGB CMYK HSL
row 1 a 11 Red 620 to 750 nm (255,0,0) (0.00, 1.00, 1.00, 0.00) nan
row 2 b 12 Orange 590 to 620 nm nan (0.00, 0.35, 1.00, 0.00) nan
row 3 c 13 Yellow 570 to 590 nm (255,255,0) nan (0.17, 1.00, 0.50)
row 4 d 14 Green 495 to 570 nm (0,250,0) nan (0.33, 1.00, 0.50)
row 5 e 15 Blue 450 to 495 nm (0,0,255) (1.00, 1.00, 0.00, 0.00) nan
row 6 f 16 Indigo 425 to 450 nm (75,0,130) (0.42, 1.00, 0.00, 0.49) nan
row 7 g 17 Violet 380 to 450 nm (238,130,238) (0.00, 0.45, 0.00, 0.07) (0.83, 0.76, 0.72)