SurvivalAnalysis Class¶
Implementation of the cure rate survival analysis, based on the work by Dr. Nemanja Kosovalic and Prof. Sandip Barui, both formerly at the University of South Alabama. Current implementation supports option to use Hard EM algorithm and SCAR (selected completely at random), among others, for estimating cure probabilites.
Primary Author: Nem Kosovalic (nem.kosovalic@aimpointdigital.com)
Secondary Author: Yash Puranik (yash.puranik@aimpointdigital.com)
Company: Aimpoint Digital, LP
- class survival_analysis.SurvivalAnalysis(estimator='hard_EM', classifier=None, random_state=None)[source]¶
Bases:
object
SurvivalAnalysis class
This class implements methods for survival analysis: i.e. time-to-event analysis using a dataset for which outcomes of a fraction of the dataset are unknown (censored), and for which some individuals never experience the event. For example, a medical trial where time since diagnosis is measured and some participants drop out during the trial period, some of which are cured. Or a manufacturing dataset where time-to-failure since maintenace is measured, and some equipment takes so long to fail that it “never sees” the event.
For every training point (a patient in a medical trial, or machine in a manufacturing dataset), we have three potential labels:
Cured: The event never occurs. Non-cured: The event occurs at some (possibly unobserved) time. Censored: The event does not occur during the time period, while it may or may not occur later, but is unobserved.
The methods in this class help estimate the labels (Cured/Non-cured) for each of the points in the censored group. Using the estimated labels, a classifier is built to predict whether a training point will belong to the cured group or the non-cured group.
In addition, the methods in the class help predict the overall survival probability of any training point upto time t, as well as the risk for a training point at time t
- Parameters
estimator ({'hard_EM', 'clustering', 'fifty_fifty', 'all_cens_cured'}, default='hard_EM') – The method used for estimating labels for censored training data. If ‘hard_EM’, the hard expectation maximization algorithm is utilized to estimate labels for censored population. If ‘clustering’, a clustering model is built with a single cluster for the non-censored rows, and two clusters for the censored rows. By comparing the distance of the two censored clusters to the noncensored cluster center, cure labels are assigned to the censored rows. If ‘fifty_fifty’, each censored point is assigned to the cured/non-cured group with equal probability. Finally, ‘all-cens-cured’ assumes that all censored population are cured.
classifier (A classifier object, default=None) – A fully initialized classification model object that follows the scikit-learn classification API. The model is used to classify between cured and non-cured labels. If no classifier input is provided, a logistic regression model with default parameters will be utilized.
random_state (int, array_like, SeedSquence, BitGenerator or Generator or None, default=None) – Random state for reproducible random number generation in the algorithms.
- estimator_¶
Method used for estimating censored data
- Type
str
- classifier_¶
A fully initialized classification object with scikit-learn classification API
- Type
A classification object
- scale_¶
Scale parameter for the fitted Weibull distribution
- Type
float
- shape_¶
Shape parameter for the fitted Weibull distribution
- Type
float
- gamma_¶
Gamma parameter for survival function
- Type
np.array
References
Kosovalić, N., Barui, S. A Hard EM algorithm for prediction of the cured fraction in survival data. Comput Stat (2021). https://doi.org/10.1007/s00180-021-01140-0
- cindex(test_times, test_labels, danger)[source]¶
Returns the concordance index on test/score data. I.e. the ratio of concordant pairs to comparable ones.
- Parameters
test_times ({array-like} of shape (n_samples, 1)) – Column of times of event/times of censoring.
test_labels ({array-like} of shape (n_samples, 1)) – Column indicating whether censored or event was observed
danger (float) – Measures the risk of an individual. The higher the riskier. Sometimes taken as ln of proportionality factor (exponential term) from proportional hazard function. In this setting, a risk also incorporating the (non) cure probability is reasonable. E.g. ln(1-pi)+gamma*x, or something similar.
- Returns
c-index – Concordance index
- Return type
float
- get_censor_label()[source]¶
Return the value for the label to be utilized to pass ‘censored’ rows :param None:
- Returns
censor_label – Integer value to be used to denote censored training points
- Return type
int
- get_covariate_importance()[source]¶
Returns the relative weights for logistic regression for each covariate in data set
- Returns
factors – Relative weight of each factor in the dataset
- Return type
{array-like} of shape (n_features, 1)
- get_cure_label()[source]¶
Return the value for the label to be utilized to indicate ‘cured’ rows :param None:
- Returns
cure_label – Integer value to be used to denote cured training points
- Return type
int
- get_non_cure_label()[source]¶
Return the value for the label to be utilized to indicate ‘non-cured’ rows :param None:
- Returns
non_cure_label – Integer value to be used to denote non-cured training points
- Return type
int
- get_params(deep=True)[source]¶
Get parameters for the estimator
- Parameters
deep (bool, default=True) – If True, will return the parameters for SurvivalAnalysis class and contained subojects
- Returns
params – Parameter names for class mapped to values
- Return type
dict
- get_risk_factors()[source]¶
Returns the relative risk factors associated with every covariate in data set
- Returns
factors – Relative weight of each factor in the dataset
- Return type
{array-like} of shape (n_features, 1)
- predict_cure_labels(test_data, test_labels=None)[source]¶
Predict cured/non-cured labels for data
- Parameters
test_data ({array-like} of shape (n_samples, n_features)) – Test data
test_labels ({array-like} of shape (n_samples, 1), default=None) – Test labels indicating censored/non censored status for test data. Method will provide cure predictions for censored individuals
- Returns
predictions – Predicted cured/non_cured labels
- Return type
{array-like} of shape (n_samples, 1)
- predict_cure_proba(test_data, test_labels=None)[source]¶
Generate probability estimates for each training point as to whether it is cured/noncured
- Parameters
test_data ({array-like} of shape (n_samples, n_features)) – Test data
test_labels ({array-like} of shape (n_samples, 1), default=None) – Test labels indicating censored/non censored status for test data. Method will provide cure predictions for censored individuals
- Returns
probabilities – Predicted cured, non_cured probabilities
- Return type
{array-like} shape (n_samples, 2)
- predict_danger(test_data, test_labels=None, weights=None)[source]¶
Generate a value for the danger, a measure of risk associated with an individual of facing the event. The absolute value of the number is unimportant. When comparing danger for two individuals, their relative values are important.
- Parameters
test_data ({array-like} of shape (n_samples, n_features)) – Test data
test_labels ({array-like} of shape (n_samples, 1), default=None) – Test labels indicating censored/non censored status for test data. Method will provide cure predictions for censored individuals
weights ({array-like} of shape (2, 1), default=None) – Normalizing weights for calculating danger. If None, weights of (0.5, 0.5) are used
- Returns
danger – Danger associated with every individual in training set
- Return type
{array-like} shape (n_samples, 1)
- predict_overall_survival(test_data, test_times, test_labels=None)[source]¶
Predict overall survival function for test data based on fitting survival function.
- Parameters
test_data ({array-like} of shape (n_samples, n_features)) – Test data
test_times ({array-like} of shape (n_samples, k)) – Times at which survival of an individual is to be determined. k can be greater than 1 when the survival of an individual at multiple time points is to be determined
test_labels ({array-like} of shape (n_samples, 1), default=None) – Test labels indicating censored/non censored status for test data. Method will provide cure predictions for censored individuals. This is only needed if model is fit with is_scar=True assumption
- Returns
predictions – Overall survival function of any suspectible or non-susceptible individual
- Return type
{array-like} of shape (n_samples, k)
- pu_fit(training_data, training_labels, **kwargs)[source]¶
Fits a model using censored and non censored inputs to estimate cured/non-cured labels
- Parameters
training_data ({array-like} of shape (n_samples, n_features)) – Training data
training_labels ({array-like} of shape (n_samples, 1)) – Labels for the training data. Value of _censor_label_ implies training point was censored, and _non_censor_label_ implies non censored. The _censor_label_ and _non_censor_label_ values can be obtained with get_censor_label and get_non_censor_label methods
include (Optional kwargs) –
pu_reg_term (float, default=0.5) – The strength of the quadratic regularization term (C*w^2) on the non-intercept model covariate weights for PU learning
pu_initialize ({'censoring_rate', 'use_clustering', 'use_random'},) –
default=’use_random’
Method to determine how initial guesses for the SLSQP minimization are generated. The covariate weights are initialized at random from a uniform distribution. If the option ‘censoring_rate’ is selected, cure labels are initialized assuming the probability of being cured is the censoring rate. If ‘use_clustering’ is selected then a single cluster is created from the noncensored rows, and two clusters are created from the censored rows. The cluster closest to the noncensored rows is assigned the label non_cured as an initial guess. If “use_random” is selected, multiple guesses for the unkown cure labels are generated randomly, multiple minimization problems are solved. The output chosen is the one corresponding to the lowest objective
pu_max_guesses (int, default=50) – Maximum local searches to launch when initialize method is use_random. Otherwise ignored
pu_max_processes (int, default=1) – Maximum parallel local searches to launch when initialize method is use_random. Otherwise ignored. if -1, all available processors are utilized
pu_max_iter (int, default=1000) – maximum number of iterations for SLSQP method.
pu_weight_lo ({array-like} of shape (n_features, 1), default=-0.5) – Lower bounds on weights for sampling from uniform distribution for initial value guesses
pu_weight_hi ({array-like} of shape (n_features, 1), default=0.5) – Upper bounds on weights for sampling from uniform distribution for initial value guesses
pu_kmeans_init (int) – Number of Kmeans initializations to try when estimator is clustering
is_scar ({True, False}, default=False) – True if SCAR (selected completely at random) assumption holds for the dataset. In this situation, we find the probability of being NOT censored given covariates. The latter is then divided by an appropriate constant to get probability of being NOT cured. See the paper https://cseweb.ucsd.edu/~elkan/posonly.pdf for more details
- Returns
Fitted estimator.
- Return type
self
- score_labels(test_data, test_labels, sample_weight=None)[source]¶
Returns the accuracy score on a given test_data provided labels :param test_data: test data :type test_data: {array-like} of shape (n_samples, n_features) :param test_labels: Labels for the test data. :type test_labels: {array-like} of shape (n_samples, 1) :param sample_weight: Sample weights :type sample_weight: array_like of shape (n_samples,), default = None
- Returns
score – Mean accuracy
- Return type
float
- set_params(**params)[source]¶
Set the parameters of this SurvivalAnalysis object. :param **params: Estimator parameters :type **params: dict
- Returns
self
- Return type
SurvivalAnalysis object
- stochastic_fit(training_data, training_labels, training_times, **kwargs)[source]¶
Fits the lifetime parameters and survival function and returns a fitted object. This is a meta algorithm heuristic for large datasets where survival_fit does not scale well, due to the computational bottleneck of censored individuals.
Step 1: split into smaller datasets Step 2: for each dataset run survival fit Step 3: as final output take average of parameters.
- Parameters
training_data ({array-like} of shape (n_samples, n_features)) – Training data
training_labels ({array-like} of shape (n_samples, n_features)) – Labels for training_data
training_times ({array-like} of shape (n_samples, 1)) – Times for all training points (censoringtime or event time)
include (Optional kwargs) –
surv_reg_term (float, default=0.5) – strength of regularization parameter
surv_max_iter (int, default=1000) – maximum number of iterations for SLSQP method.
surv_batch_size (int, default=200) – The maximum size for a batch for stochastic fit
is_scar ({True, False}, default=False) – True if SCAR (selected completely at random) assumption holds for the dataset. In this situation, we find the probability of being NOT censored given covariates. The latter is then divided by an appropriate constant to get probability of being NOT cured. See the paper https://cseweb.ucsd.edu/~elkan/posonly.pdf for more details
- Returns
Fitted estimator.
- Return type
self
- survival_fit(training_data, training_labels, training_times, **kwargs)[source]¶
Fits the lifetime parameters and survival function and returns a fitted object
- Parameters
training_data ({array-like} of shape (n_samples, n_features)) – Training data
training_labels ({array-like} of shape (n_samples, n_features)) – Labels for training_data
training_times ({array-like} of shape (n_samples, 1)) – Times for all training points (censoring time or event time)
include (Optional kwargs) –
surv_reg_term (float, default=0.5) – The strength of the quadratic regularization term (C*w^2) on the non-intercept model covariate weights for survival fit.
surv_max_iter (int, default=1000) – maximum number of iterations for SLSQP method.
is_scar ({True, False}, default=False) – True if SCAR (selected completely at random) assumption holds for the dataset. In this situation, we find the probability of being NOT censored given covariates. The latter is then divided by an appropriate constant to get probability of being NOT cured. See the paper https://cseweb.ucsd.edu/~elkan/posonly.pdf for more details
- Returns
Fitted estimator.
- Return type
self