import numpy as np
import pandas as pd
import patsy as pt
import lifelines as lf
import matplotlib.pyplot as plt import statsmodels as sm
PYTHON
Deux paquets d’analyse: principalementlifelines
(km, cox, aft…) et statsmodels
(estimation logit en temps discret, kaplan-Meier, Cox).
Le package statsmodels
est également ne mesure d’estimer des courbes de séjour de type Kaplan-Meier et des modèles à risque proportionnel de Cox. Le package lifelines
couvre la quasi totalité des méthodes standards, à l’exception des les risques concurrents.
import os'QT_QPA_PLATFORM_PLUGIN_PATH'] ='C:/Users/thevenin_m/AppData\Local/Continuum/anaconda3/Lib/site-packages/PyQt5/Qt5/plugins/platforms' os.environ[
Chargement de la base
= pd.read_csv("https://raw.githubusercontent.com/mthevenin/analyse_duree/master/bases/transplantation.csv")
trans trans.head(10)
trans.info()
Package lifelines
https://lifelines.readthedocs.io/en/latest/
Non Paramétrique: Kaplan Meier
Estimateur KM et durée médiane
= trans['stime']
T = trans['died']
E
from lifelines import KaplanMeierFitter= KaplanMeierFitter()
kmf kmf.fit(T,E)
print(kmf.survival_function_)
= "DUREE MEDIANE:"
a = kmf.median_survival_time_
b print(a,b)
KM_estimate
timeline 0.0 1.000000
1.0 0.990291
2.0 0.961165
3.0 0.932039
5.0 0.912621
... ...1400.0 0.151912
1407.0 0.151912
1571.0 0.151912
1586.0 0.151912
1799.0 0.151912
89 rows x 1 columns]
[: 100.0 DUREE MEDIANE
kmf.plot()
Comparaison des fonctions de survie
= plt.subplot(111)
ax = KaplanMeierFitter()
kmf for name, grouped_df in trans.groupby('surgery'):
kmf.fit(grouped_df['stime'], grouped_df['died'], label=name)
kmf.plot(ax=ax)
from lifelines.statistics import multivariate_logrank_test= multivariate_logrank_test(trans['stime'], trans['surgery'], trans['died'])
results results.print_summary()
<lifelines.StatisticalResult: multivariate_logrank_test>
= -1
t_0 = chi squared
null_distribution = 1
degrees_of_freedom = multivariate_logrank_test
test_name
---
-log2(p)
test_statistic p 6.59 0.01 6.61
Semi paramétrique: Cox
Estimation
= 'year + age + C(surgery) -1'
model = pt.dmatrix(model, trans, return_type='dataframe')
X = X.design_info
design_info = X.join(trans[['stime','died']])
YX YX.drop(['C(surgery)[0]'], axis=1, inplace=True)
YX.head()
from lifelines import CoxPHFitter= CoxPHFitter()
cph cph.fit(YX, duration_col='stime', event_col='died')
cph.print_summary()
cph.plot()
<lifelines.CoxPHFitter: fitted with 103 total observations, 28 right-censored observations>
= 'stime'
duration col = 'died'
event col = breslow
baseline estimation = 103
number of observations = 75
number of events observed -likelihood = -289.31
partial log= 2021-04-21 13:24:52 UTC
time fit was run
---
exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper
coef > 95%
C(surgery)[1] -0.99 0.37 0.44 -1.84 -0.13 0.16
> 0.88
-0.12 0.89 0.07 -0.25 0.01 0.78
year > 1.01
0.03 1.03 0.01 0.00 0.06 1.00
age > 1.06
-log2(p)
z p C(surgery)[1] -2.26 0.02 5.40
-1.78 0.08 3.72
year 2.19 0.03 5.12
age ---
= 0.65
Concordance = 584.61
Partial AIC -likelihood ratio test = 17.63 on 3 df
log-log2(p) of ll-ratio test = 10.90
Tests hypothèse PH
Test PH: Schoenfeld Méthode 1
cph.check_assumptions(YX,p_value_threshold=0.05)
``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
The
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.
in mind, it's best to use a combination of statistical tests and visual tests to determine
With that the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for non-constant lines. See link [A] below for a full example.
<lifelines.StatisticalResult: proportional_hazard_test>
null_distribution = chi squared
degrees_of_freedom = 1
test_name = proportional_hazard_test
---
test_statistic p -log2(p)
C(surgery)[1] km 4.01 0.05 4.47
rank 3.74 0.05 4.23
age km 1.18 0.28 1.86
rank 1.06 0.30 1.72
year km 2.07 0.15 2.73
rank 2.08 0.15 2.75
1. Variable 'C(surgery)[1]' failed the non-proportional test: p-value is 0.0452.
Test PH: Schoenfeld Méthode 2
from lifelines.statistics import proportional_hazard_test = proportional_hazard_test(cph, YX, time_transform='all')
zph zph.print_summary()
<lifelines.StatisticalResult: proportional_hazard_test>
= chi squared
null_distribution = 1
degrees_of_freedom = proportional_hazard_test
test_name
---
-log2(p)
test_statistic p C(surgery)[1] identity 5.54 0.02 5.75
4.01 0.05 4.47
km 3.69 0.05 4.19
log 3.74 0.05 4.23
rank 1.61 0.20 2.29
age identity 1.18 0.28 1.86
km 0.61 0.44 1.20
log 1.06 0.30 1.72
rank 0.80 0.37 1.43
year identity 2.07 0.15 2.73
km 1.34 0.25 2.02
log 2.08 0.15 2.75 rank
Test PH: intéraction
from lifelines.utils import to_episodic_format from lifelines import CoxTimeVaryingFitter
Transformation de la base YX
= to_episodic_format(YX, duration_col='stime', event_col='died') long
Création de la variable d’intéraction
'surgery_t'] = long['C(surgery)[1]'] * long['stop'] long[
Estimation
= CoxTimeVaryingFitter()
ctv ctv.fit(long,
id_col='id',
event_col='died',
start_col='start',
stop_col='stop',)
ctv.print_summary(4)
<lifelines.CoxTimeVaryingFitter: fitted with 31938 periods, 103 subjects, 75 events>
= 'died'
event col = 103
number of subjects = 31938
number of periods = 75
number of events -likelihood = -287.3290
partial log= 2021-04-21 13:32:40 UTC
time fit was run
---
exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
coef C(surgery)[1] -1.7547 0.1730 0.6743 -3.0764 -0.4331 0.0461 0.6485
0.0289 1.0293 0.0134 0.0025 0.0552 1.0025 1.0568
age -0.1231 0.8842 0.0668 -0.2541 0.0079 0.7756 1.0080
year 0.0022 1.0022 0.0011 0.0001 0.0044 1.0001 1.0044
surgery_t
-log2(p)
z p C(surgery)[1] -2.6022 0.0093 6.7542
2.1479 0.0317 4.9785
age -1.8415 0.0656 3.9312
year 2.0239 0.0430 4.5402
surgery_t ---
= 582.6581
Partial AIC -likelihood ratio test = 21.5846 on 4 df
log-log2(p) of ll-ratio test = 12.0103
Variable dynamique binaire
Modèle à temps discret
Ajustement continu
Modèle logistique estimé avec le paquet statsmodel
. La fonction to_episodic_format
de lifelines
permet de mettre en forme la base.
Pour la durée, on utilisera ici la variable mois (regroupement de stime par intervalle de 30 jours).
#type R formule => ce qu'on utilisera#
import statsmodels.formula.api as smf #type python# import statsmodels.api as sm
Transformation de la base en format long
= pd.read_csv("D:/Marc/SMS/FORMATIONS/2019/analyse durees/a distribuer/transplantation.csv")
td td.drop(['id'], axis=1, inplace=True)
'dur'] = td['mois']
td[= to_episodic_format(td, duration_col='mois', event_col='died') td
Recherche de l’ajustement
't2'] = td['stop']**2
td['t3'] = td['stop']**3
td[= smf.glm(formula= "died ~ stop", data=td, family=sm.families.Binomial()).fit()
fit1 = smf.glm(formula= "died ~ stop + t2", data=td, family=sm.families.Binomial()).fit()
fit2 = smf.glm(formula= "died ~ stop + t2 + t3", data=td, family=sm.families.Binomial()).fit() fit3
Comparaison des AIC
print("AIC pour ajustement t1")
print(fit1.aic)
print("AIC pour ajustement durée t1 + t2")
print(fit2.aic)
print("AIC pour ajustement durée t1 + t2 + t3")
print(fit3.aic)
AIC pour ajustement t1512.1039235968562
+ t2
AIC pour ajustement durée t1 508.1014573009212
+ t2 + t3
AIC pour ajustement durée t1 506.1882809518765
Estimation du modèle
= smf.glm(formula= "died ~ stop + t2 + t3 + year + age + surgery", data=td, family=sm.families.Binomial()).fit()
tdfit tdfit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: died No. Observations: 1164
Model: GLM Df Residuals: 1157
Model Family: Binomial Df Model: 6
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -240.20
Date: Wed, 21 Apr 2021 Deviance: 480.39
Time: 15:44:21 Pearson chi2: 1.30e+03
No. Iterations: 7
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 6.3097 5.201 1.213 0.225 -3.884 16.503
stop -0.2807 0.077 -3.635 0.000 -0.432 -0.129
t2 0.0096 0.005 2.083 0.037 0.001 0.019
t3 -0.0001 6.97e-05 -1.493 0.135 -0.000 3.26e-05
year -0.1263 0.072 -1.747 0.081 -0.268 0.015
age 0.0337 0.014 2.330 0.020 0.005 0.062
surgery -1.0050 0.447 -2.250 0.024 -1.880 -0.130
==============================================================================
"""
Ajustement discret
Création des intervalles pour l’exemple (quantile de la durée en mois)
'ct4'] = pd.qcut(td['stop'],[0, .25, .5, .75, 1.])
td['ct4'].value_counts(normalize=True)*100
td[= pd.Categorical(td.ct4)
td.ct4 'ct4'] = td.ct4.cat.codes td[
0.999, 4.0] 27.233677
(11.0, 23.0] 24.484536
(4.0, 11.0] 24.398625
(23.0, 61.0] 23.883162
(: ct4, dtype: float64 Name
Pour chaque individu, on conserve une seule observation par intervalle.
= td
td2 't'] = td2['ct4']
td2[= td2.sort_values(['id', 'stop'])
td2 = td2.groupby(['id','ct4']).last() td2
Estimation
= smf.glm(formula= "died ~ C(t) + year + age + surgery", data=td2, family=sm.families.Binomial()).fit()
td2fit td2fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: died No. Observations: 200
Model: GLM Df Residuals: 200
Model Family: Binomial Df Model: -1
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -112.26
Date: Wed, 21 Apr 2021 Deviance: 224.52
Time: 15:54:03 Pearson chi2: 229.
No. Iterations: 5
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 11.8018 6.617 1.784 0.074 -1.167 24.770
C(t)[T.1] -0.9078 0.408 -2.227 0.026 -1.707 -0.109
C(t)[T.2] -1.8451 0.587 -3.141 0.002 -2.996 -0.694
C(t)[T.3] -0.3224 0.578 -0.557 0.577 -1.456 0.811
year -0.1947 0.093 -2.104 0.035 -0.376 -0.013
age 0.0468 0.018 2.543 0.011 0.011 0.083
surgery -1.1025 0.503 -2.192 0.028 -2.088 -0.117
==============================================================================
"""
Modèle paramétrique de type AFT
from lifelines import WeibullAFTFitter, LogLogisticAFTFitter
Weibull
= WeibullAFTFitter()
aftw aftw.fit(YX, duration_col='stime', event_col='died')
aftw.print_summary()
<lifelines.WeibullAFTFitter: fitted with 103 total observations, 28 right-censored observations>
= 'stime'
duration col = 'died'
event col = 103
number of observations = 75
number of events observed -likelihood = -488.17
log= 2021-04-21 13:55:14 UTC
time fit was run
---
exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
coef C(surgery)[1] 1.97 7.17 0.78 0.44 3.50 1.56 33.05
lambda_ 0.16 1.18 0.12 -0.08 0.40 0.93 1.49
year -0.06 0.94 0.02 -0.11 -0.01 0.90 0.99
age -3.02 0.05 8.73 -20.13 14.09 0.00 1.31e+06
_intercept -0.59 0.56 0.09 -0.77 -0.41 0.46 0.67
rho_ _intercept
-log2(p)
z p C(surgery)[1] 2.53 0.01 6.45
lambda_ 1.33 0.18 2.44
year -2.49 0.01 6.28
age -0.35 0.73 0.46
_intercept -6.33 <0.005 31.93
rho_ _intercept ---
= 0.65
Concordance = 986.34
AIC -likelihood ratio test = 18.87 on 3 df
log-log2(p) of ll-ratio test = 11.75
Loglogistique
= LogLogisticAFTFitter()
aftl aftl.fit(YX, duration_col='stime', event_col='died')
aftl.print_summary()
<lifelines.LogLogisticAFTFitter: fitted with 103 total observations, 28 right-censored observations>
= 'stime'
duration col = 'died'
event col = 103
number of observations = 75
number of events observed -likelihood = -482.58
log= 2021-04-21 13:55:58 UTC
time fit was run
---
exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
coef C(surgery)[1] 2.27 9.72 0.69 0.92 3.63 2.51 37.68
alpha_ 0.24 1.27 0.12 0.01 0.47 1.01 1.60
year -0.04 0.96 0.02 -0.08 -0.00 0.92 1.00
age -10.41 0.00 8.34 -26.76 5.94 0.00 381.76
_intercept -0.18 0.83 0.10 -0.37 0.01 0.69 1.01
beta_ _intercept
-log2(p)
z p C(surgery)[1] 3.29 <0.005 9.96
alpha_ 2.05 0.04 4.65
year -2.01 0.04 4.49
age -1.25 0.21 2.24
_intercept -1.86 0.06 4.00
beta_ _intercept ---
= 0.66
Concordance = 975.16
AIC -likelihood ratio test = 21.69 on 3 df
log-log2(p) of ll-ratio test = 13.69
Package statsmodels
https://www.statsmodels.org/dev/duration.html
Le package permet d’estimer des fonction de séjour de type Kaplan-Meier et des modèles de Cox.
Kaplan-Meier
= sm.SurvfuncRight(trans["stime"], trans["died"])
km km.summary()
Surv prob Surv prob SE num at risk num events
Time 1 0.990291 0.009661 103 1.0
2 0.961165 0.019037 102 3.0
3 0.932039 0.024799 99 3.0
5 0.912621 0.027825 96 2.0
6 0.893204 0.030432 94 2.0
... ... ... ... ...852 0.250655 0.048731 14 1.0
979 0.227868 0.049341 11 1.0
995 0.205081 0.049390 10 1.0
1032 0.182295 0.048877 9 1.0
1386 0.151912 0.049277 6 1.0
Les test du log-rank sont disponibles avec la fonction survdiff
(nom idem R). Au niveau graphique, la programmation semble un peu lourde et mériterait d’être simplifiée (donc non traitée).
Comparaison de S(t) à partir des tests du log-rank
Résultat: (statistique de test, p-value)
Test non pondéré
sm.duration.survdiff(trans.stime, trans.died, trans.surgery)
# (6.590012323234387, 0.010255246157888975)
Breslow
sm.duration.survdiff(trans.stime, trans.died, trans.surgery, weight_type='gb')
# (8.989753779902495, 0.0027149757927903417)
Tarone-Ware
sm.duration.survdiff(trans.stime, trans.died, trans.surgery, weight_type='tw')
# (8.462352726451392, 0.0036257256194570653)
Modèle de Cox
= smf.phreg("stime ~ year + age + surgery ",trans, status='died', ties="efron")
mod = mod.fit()
rslt print(rslt.summary())
: PHReg
Results=============================================================
: PH Reg Sample size: 103
Model: stime Num. events: 75
Dependent variable: Efron
Ties-------------------------------------------------------------
>|t| [0.025 0.975]
log HR log HR SE HR t P-------------------------------------------------------------
-0.1196 0.0673 0.8872 -1.7765 0.0757 0.7775 1.0124
year 0.0296 0.0135 1.0300 2.1872 0.0287 1.0031 1.0577
age -0.9873 0.4363 0.3726 -2.2632 0.0236 0.1584 0.8761
surgery =============================================================
for the hazard ratios Confidence intervals are