How to parse and explode nested JSON fields in a table?

Summary

The problem requires parsing and exploding nested JSON fields in a table to achieve the expected output. The original data has an ExtInfo field storing JSON-formatted extension information, which needs to be parsed to extract field values and exploded to split array elements into multiple rows while preserving scalar fields.

Root Cause

The root cause of this problem is the presence of nested JSON fields in the ExtInfo column, which cannot be directly processed using traditional data processing techniques. The JSON string contains an array of tags, which needs to be exploded into separate rows.

Why This Happens in Real Systems

This issue occurs in real systems when dealing with complex data structures, such as JSON or XML, which are commonly used for data exchange and storage. The nested structure of JSON data can make it challenging to process and analyze, especially when working with large datasets.

Real-World Impact

The real-world impact of this issue is significant, as it can affect the accuracy and efficiency of data analysis and processing. If not addressed properly, it can lead to:

  • Inaccurate results due to incorrect data processing
  • Increased processing time and resource utilization
  • Difficulty in scaling data processing workflows

Example or Code

import pandas as pd
import json

# Sample data
data = {
    'StockCode': ['600000', '600036', '000001', '000002', '600519'],
    'StockName': ['浦发银行', '招商银行', '平安银行', '万科A', '贵州茅台'],
    'Date': ['2024.01.15', '2024.01.15', '2024.01.15', '2024.01.15', '2024.01.15'],
    'ExtInfo': [
        '{"tags": ["金融", "银行", "国企"], "score": 78, "analyst": "张三", "rating": "买入"}',
        '{"tags": ["金融", "银行", "蓝筹"], "score": 85, "analyst": "李四", "rating": "增持"}',
        '{"tags": ["金融", "银行"], "score": 72, "analyst": "王五", "rating": "中性"}',
        '{"tags": ["房地产", "白马股", "低估值", "周期"], "score": 65, "analyst": "赵六", "rating": "减持"}',
        '{"tags": ["消费", "白酒", "核心资产", "高端"], "score": 95, "analyst": "钱七", "rating": "强烈推荐"}'
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Parse JSON and explode tags
df['ExtInfo'] = df['ExtInfo'].apply(json.loads)
df = df.explode('ExtInfo')
df['Tag'] = df['ExtInfo'].apply(lambda x: x['tags'])
df['Score'] = df['ExtInfo'].apply(lambda x: x['score'])
df['Analyst'] = df['ExtInfo'].apply(lambda x: x['analyst'])
df['Rating'] = df['ExtInfo'].apply(lambda x: x['rating'])

# Explode tags
df = df.explode('Tag')

# Select relevant columns
df = df[['StockCode', 'StockName', 'Date', 'Tag', 'Score', 'Analyst', 'Rating']]

How Senior Engineers Fix It

Senior engineers fix this issue by:

  • Using JSON parsing libraries to extract the nested JSON data
  • Utilizing data manipulation techniques, such as explode, to split the array elements into separate rows
  • Preserving scalar fields by applying lambda functions to extract the relevant values
  • Selecting the relevant columns to achieve the expected output

Why Juniors Miss It

Juniors may miss this issue due to:

  • Lack of experience in handling complex data structures, such as JSON
  • Insufficient knowledge of data manipulation techniques, such as explode
  • Difficulty in understanding the nested structure of JSON data
  • Inability to apply lambda functions to extract relevant values from JSON data