Bug fix commit classifier only 60% accurate - 40% false positives from refactoring/formatting commits. How to improve?

Summary

The current approach to classifying GitHub commits as bug fixes or refactoring/formatting commits has an accuracy of only 60%, with 40% false positives from refactoring commits. Improving the accuracy of the classifier is crucial to reduce the number of misclassified commits. The current dataset consists of 28K commits, with 1,200 classified as bugs, out of which 720 are correct and 480 are wrong.

Root Cause

The root cause of the low accuracy can be attributed to the following factors:

Insufficient keyword filtering: The current approach only uses 58 refactoring keywords, which misses other relevant keywords such as “migrate”, “extract method”, and “prettier”.
Lack of file path filtering: The approach does not filter out commits that only affect non-code files, such as test files, documentation, and build files.
No AST detection: The approach does not use Abstract Syntax Tree (AST) analysis to detect specific code semantics, such as null checks, exception handling, and bounds checks.
Post-filtering: The approach applies filters after expensive analysis, which can lead to unnecessary computations.

Why This Happens in Real Systems

This issue occurs in real systems due to the following reasons:

Complexity of code changes: Code changes can be complex and nuanced, making it challenging to accurately classify them as bug fixes or refactoring commits.
Limited training data: The training dataset may not be comprehensive enough to cover all possible scenarios, leading to overfitting or underfitting of the model.
Evolving codebase: The codebase is constantly evolving, with new features, bug fixes, and refactoring commits being added, which can affect the accuracy of the classifier.

Real-World Impact

The real-world impact of this issue is significant, as it can lead to:

Inaccurate metrics: Inaccurate classification of commits can lead to incorrect metrics, such as bug fix rates, refactoring rates, and code quality metrics.
Wasted resources: Misclassified commits can lead to wasted resources, such as unnecessary code reviews, testing, and debugging.
Delayed feedback: Inaccurate classification can delay feedback to developers, making it challenging to identify areas for improvement.

Example or Code

import ast

def is_bug_fix(message, diff):
    # AST analysis to detect null checks, exception handling, and bounds checks
    tree = ast.parse(diff)
    for node in ast.walk(tree):
        if isinstance(node, ast.If) and node.test.left.id == 'x' and node.test.op == ast.Eq and node.test.comparators[0] == None:
            return True
        elif isinstance(node, ast.Try):
            return True
        elif isinstance(node, ast.For) and node.target.id == 'i' and node.iter.func.id == 'range' and node.iter.args[0].id == 'arr':
            return True
    return False

How Senior Engineers Fix It

Senior engineers can fix this issue by:

Improving keyword filtering: Using a more comprehensive set of refactoring keywords and filtering out non-code files.
Implementing AST analysis: Using AST analysis to detect specific code semantics, such as null checks, exception handling, and bounds checks.
Using machine learning techniques: Using machine learning techniques, such as code embeddings or static analysis, to improve the accuracy of the classifier.
Multi-label classification: Using multi-label classification to handle mixed semantics, where a commit can be both a bug fix and a refactoring commit.

Why Juniors Miss It

Juniors may miss this issue due to:

Lack of experience: Limited experience with code analysis and machine learning techniques.
Insufficient knowledge: Limited knowledge of code semantics, AST analysis, and machine learning techniques.
Overreliance on simple solutions: Overreliance on simple solutions, such as keyword filtering, without considering the complexity of code changes.
Inadequate testing: Inadequate testing and validation of the classifier, leading to overfitting or underfitting of the model.

Bug fix commit classifier only 60% accurate – 40% false positives from refactoring/formatting commits. How to improve?