Machine Learning: Fisher-Irwing test with multilabels

Summary

The Fisher-Irwin test is used for enrichment analysis, but it can be challenging when dealing with composite labels. A common issue arises when proteins are annotated with combined functional labels, which can lead to biased contingency tables. To address this, a mapping from original labels to their component properties can be created.

Root Cause

The root cause of this issue is the presence of composite labels in the dataset, which can lead to:

  • Incorrect categorization of proteins
  • Reduced apparent enrichment
  • Biased contingency tables

Why This Happens in Real Systems

This issue occurs in real systems due to:

  • Incomplete annotation: Proteins may be annotated with combined functional labels instead of individual properties
  • Lack of standardization: Different datasets may use different labeling conventions
  • Complexity of biological systems: Proteins can have multiple functions, making it challenging to assign a single label

Real-World Impact

The impact of this issue can be significant, leading to:

  • Inaccurate results: Biased contingency tables can affect the outcome of the Fisher-Irwin test
  • Misinterpretation of data: Incorrect categorization of proteins can lead to incorrect conclusions
  • Wasted resources: Repeating experiments or analyses due to incorrect results can be costly

Example or Code (if necessary and relevant)

import pandas as pd

# Create a sample dataset
data = {
    'Protein': ['A', 'B', 'C', 'D'],
    'Label': ['assembly', 'assemblyinfection', 'infection', 'assembly']
}

df = pd.DataFrame(data)

# Create a mapping from original labels to component properties
label_mapping = {
    'assemblyinfection': ['assembly', 'infection']
}

# Expand labels into sets of properties
def expand_labels(label):
    if label in label_mapping:
        return set(label_mapping[label])
    else:
        return {label}

df['Properties'] = df['Label'].apply(expand_labels)

print(df)

How Senior Engineers Fix It

Senior engineers address this issue by:

  • Creating a robust label mapping: Using a comprehensive and standardized mapping from original labels to component properties
  • Implementing a systematic approach: Expanding labels into sets of properties and computing contingency tables accordingly
  • Avoiding double counting: Using sets to represent properties and ensuring that each protein is only counted once

Why Juniors Miss It

Junior engineers may miss this issue due to:

  • Lack of experience: Limited exposure to complex datasets and labeling conventions
  • Insufficient understanding: Not fully grasping the implications of composite labels on contingency tables
  • Overlooking details: Failing to consider the potential for biased contingency tables and incorrect categorization of proteins