CaseExplainer ============= The main class for creating case-based explanations. .. currentmodule:: case_explainer .. autoclass:: CaseExplainer :members: :undoc-members: :show-inheritance: :special-members: __init__ Core Methods ------------ Building the Explainer ^^^^^^^^^^^^^^^^^^^^^^^ .. automethod:: CaseExplainer.__init__ Explaining Predictions ^^^^^^^^^^^^^^^^^^^^^^^ .. automethod:: CaseExplainer.explain_instance .. automethod:: CaseExplainer.explain_batch Example Usage ------------- Basic Example ^^^^^^^^^^^^^ .. code-block:: python from case_explainer import CaseExplainer from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier # Load and split data data = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split( data.data, data.target, test_size=0.3, random_state=42 ) # Train classifier clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # Create explainer explainer = CaseExplainer( X_train=X_train, y_train=y_train, feature_names=data.feature_names, algorithm='ball_tree', scale_data=True ) # Explain a prediction explanation = explainer.explain_instance( test_sample=X_test[0], k=5, model=clf, true_class=y_test[0] ) print(f"Correspondence: {explanation.correspondence:.2%}") print(f"Predicted class: {explanation.predicted_class}") print(f"Correct: {explanation.is_correct()}") Batch Explanations ^^^^^^^^^^^^^^^^^^ .. code-block:: python # Explain multiple predictions at once explanations = explainer.explain_batch( X_test[:100], k=5, y_test=y_test[:100], model=clf ) # Analyze correspondence distribution correspondences = [exp.correspondence for exp in explanations] correct_corr = [exp.correspondence for exp in explanations if exp.is_correct()] incorrect_corr = [exp.correspondence for exp in explanations if not exp.is_correct()] print(f"Mean correspondence: {sum(correspondences)/len(correspondences):.2%}") print(f"Correct predictions: {sum(correct_corr)/len(correct_corr):.2%}") print(f"Incorrect predictions: {sum(incorrect_corr)/len(incorrect_corr):.2%}") Working with Metadata ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python # Attach metadata to training samples metadata = { 'sample_id': [f"patient_{i}" for i in range(len(X_train))], 'date': ['2024-01-01'] * len(X_train), 'source': ['hospital_A'] * len(X_train) } explainer = CaseExplainer( X_train=X_train, y_train=y_train, metadata=metadata, algorithm='ball_tree' ) # Access metadata in explanations explanation = explainer.explain_instance(X_test[0], k=5, model=clf) for neighbor in explanation.neighbors: print(f"Neighbor {neighbor.index}: {neighbor.metadata}") Configuration Options --------------------- Algorithm Selection ^^^^^^^^^^^^^^^^^^^ Choose the indexing algorithm based on your data characteristics: * **kd_tree**: Best for low-dimensional data (<20 features), fastest for small to medium datasets * **ball_tree**: Better for high-dimensional data (>20 features), good all-around choice * **brute**: Exact search, only recommended for small datasets (<5k samples) * **auto**: Let scikit-learn choose based on data characteristics (default) .. code-block:: python # For low-dimensional data explainer = CaseExplainer(X_train, y_train, algorithm='kd_tree') # For high-dimensional data explainer = CaseExplainer(X_train, y_train, algorithm='ball_tree') # For very small datasets explainer = CaseExplainer(X_train, y_train, algorithm='brute') Feature Scaling ^^^^^^^^^^^^^^^ Feature scaling is recommended to prevent features with large ranges from dominating distance calculations: .. code-block:: python # With scaling (recommended) explainer = CaseExplainer(X_train, y_train, scale_data=True) # Without scaling (if features are already normalized) explainer = CaseExplainer(X_train, y_train, scale_data=False) Class Weights ^^^^^^^^^^^^^ For imbalanced datasets, you can weight classes differently in correspondence computation: .. code-block:: python # Weight minority class more heavily explainer = CaseExplainer( X_train, y_train, class_weights={0: 1.0, 1: 5.0} # Weight class 1 five times more ) Notes ----- **Performance Considerations** * Index building time is O(n log n) for tree-based methods * Query time is O(log n) for tree-based methods, O(n) for brute force * Memory usage scales with dataset size and dimensionality * Use ``n_jobs=-1`` to parallelize nearest neighbor search **Correspondence Interpretation** * **High (≥85%)**: Strong agreement with training precedent, high confidence * **Medium (70-85%)**: Moderate agreement, reasonable confidence * **Low (<70%)**: Weak agreement, prediction may be uncertain or unusual See Also -------- * :class:`Explanation`: The explanation object returned by ``explain_instance`` * :mod:`case_explainer.metrics`: Correspondence and distance metrics