18  Python

Deux paquets d’analyse: principalement lifelines (km, cox, aft…) et `statsmodels``` (estimation logit en temps discret, kaplan-Meier, Cox).

Le package statsmodels est également en mesure d’estimer des courbes de séjour de type Kaplan-Meier et des modèles à risque proportionnel de Cox. Le package lifelines couvre la quasi totalité des méthodes standards, à l’exception des risques concurrents.

Le calcul des Rmst a été ajouté au package lifelines récemment. J’ai ajouté cette mise à jour.

# Penser à installer les packages: pip install nom_package

import numpy  as np
import pandas as pd
import patsy  as pt
import lifelines as lf
import matplotlib.pyplot as plt
import statsmodels as sm

Chargement de la base

trans = pd.read_csv("https://raw.githubusercontent.com/mthevenin/analyse_duree/master/bases/transplantation.csv")

trans.head(10)
trans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   id          103 non-null    int64
 1   year        103 non-null    int64
 2   age         103 non-null    int64
 3   died        103 non-null    int64
 4   stime       103 non-null    int64
 5   surgery     103 non-null    int64
 6   transplant  103 non-null    int64
 7   wait        103 non-null    int64
 8   mois        103 non-null    int64
 9   compet      103 non-null    int64
dtypes: int64(10)
memory usage: 8.2 KB

18.1 Package lifelines

Documentation: https://lifelines.readthedocs.io/en/latest/

18.1.1 Non Paramétrique: Kaplan Meier

18.1.1.1 Calcul des estimateurs

Estimateur KM et durée médiane

T = trans['stime']
E = trans['died']


from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(T,E)
print(kmf.survival_function_)
a = "DUREE MEDIANE:"
b = kmf.median_survival_time_
print(a,b)
          KM_estimate
timeline             
0.0          1.000000
1.0          0.990291
2.0          0.961165
3.0          0.932039
5.0          0.912621
...               ...
1400.0       0.151912
1407.0       0.151912
1571.0       0.151912
1586.0       0.151912
1799.0       0.151912

[89 rows x 1 columns]
DUREE MEDIANE: 100.0
kmf.plot()

Comparaison des fonctions de survie

ax = plt.subplot(111)
kmf = KaplanMeierFitter()
for name, grouped_df in trans.groupby('surgery'):
    kmf.fit(grouped_df['stime'], grouped_df['died'], label=name)
    kmf.plot(ax=ax)

18.1.1.2 Tests du logrank

from lifelines.statistics import multivariate_logrank_test
results = multivariate_logrank_test(trans['stime'], trans['surgery'], trans['died'])
results.print_summary()
t_0 -1
null_distribution chi squared
degrees_of_freedom 1
test_name multivariate_logrank_test
test_statistic p -log2(p)
0 6.59 0.01 6.61

18.1.1.3 Calcul des Rmst

Note
  • Update 2024.
  • Programmation très lourde.
  • Pas de test de comparaison des RMST (différence ou ratio).
  • Chargement de la fonction
from lifelines.utils import restricted_mean_survival_time
  • Définition de la valeur du groupe exposé (ici surgery égal à 1)
ix = trans['surgery'] == 1
  • Définition de la durée maximale. Ici 1407 jours (idem R par défaut)
tmax = 1407
  • Calcul des Rmst
kmf_1 = KaplanMeierFitter().fit(T[ix], E[ix], label='Opérés')
rmst_1 = restricted_mean_survival_time(kmf_1, t=tmax)

kmf_0 = KaplanMeierFitter().fit(T[~ix], E[~ix], label='Non opérés')
rmst_0 = restricted_mean_survival_time(kmf_0, t=tmax)

Rmst pour surgery = 0

rmst_0
379.14763007035475

Rmst pour surgery = 1

rmst_1
884.5757575757575
  • Courbes des Rmst pour tmax = 1407
from matplotlib import pyplot as plt
from lifelines.plotting import rmst_plot

ax = plt.subplot(311)
rmst_plot(kmf_1, t=tmax, ax=ax)

ax = plt.subplot(312)
rmst_plot(kmf_0, t=tmax, ax=ax)

ax = plt.subplot(313)
rmst_plot(kmf_1, model2=kmf_0, t=tmax, ax=ax)

18.1.2 Semi paramétrique: Cox

18.1.2.1 Estimation

model = 'year + age + C(surgery) -1'
X = pt.dmatrix(model, trans, return_type='dataframe')
design_info = X.design_info
YX = X.join(trans[['stime','died']])
YX.drop(['C(surgery)[0]'], axis=1, inplace=True)
YX.head()


from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(YX, duration_col='stime', event_col='died')
cph.print_summary()
cph.plot()
model lifelines.CoxPHFitter
duration col 'stime'
event col 'died'
baseline estimation breslow
number of observations 103
number of events observed 75
partial log-likelihood -289.31
time fit was run 2024-09-24 06:52:44 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% cmp to z p -log2(p)
C(surgery)[1] -0.99 0.37 0.44 -1.84 -0.13 0.16 0.88 0.00 -2.26 0.02 5.40
year -0.12 0.89 0.07 -0.25 0.01 0.78 1.01 0.00 -1.78 0.08 3.72
age 0.03 1.03 0.01 0.00 0.06 1.00 1.06 0.00 2.19 0.03 5.12

Concordance 0.65
Partial AIC 584.61
log-likelihood ratio test 17.63 on 3 df
-log2(p) of ll-ratio test 10.90

18.1.2.2 Tests hypothèse PH

Test PH: Schoenfeld Méthode 1

cph.check_assumptions(YX,p_value_threshold=0.05)
The ``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.

With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for non-constant lines. See link [A] below for a full example.



1. Variable 'C(surgery)[1]' failed the non-proportional test: p-value is 0.0452.

   Advice: with so few unique values (only 2), you can include `strata=['C(surgery)[1]', ...]` in
the call in `.fit`. See documentation in link [E] below.

---
[A]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Bin-variable-and-stratify-on-it
[C]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
[D]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modify-the-functional-form
[E]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification
null_distribution chi squared
degrees_of_freedom 1
model <lifelines.CoxPHFitter: fitted with 103 total ...
test_name proportional_hazard_test
test_statistic p -log2(p)
C(surgery)[1] km 4.01 0.05 4.47
rank 3.74 0.05 4.23
age km 1.18 0.28 1.86
rank 1.06 0.30 1.72
year km 2.07 0.15 2.73
rank 2.08 0.15 2.75
[]

Test PH: Schoenfeld Méthode 2

from lifelines.statistics import  proportional_hazard_test 
zph = proportional_hazard_test(cph, YX, time_transform='all')
zph.print_summary()
null_distribution chi squared
degrees_of_freedom 1
model <lifelines.CoxPHFitter: fitted with 103 total ...
test_name proportional_hazard_test
test_statistic p -log2(p)
C(surgery)[1] identity 5.54 0.02 5.75
km 4.01 0.05 4.47
log 3.69 0.05 4.19
rank 3.74 0.05 4.23
age identity 1.61 0.20 2.29
km 1.18 0.28 1.86
log 0.61 0.44 1.20
rank 1.06 0.30 1.72
year identity 0.80 0.37 1.43
km 2.07 0.15 2.73
log 1.34 0.25 2.02
rank 2.08 0.15 2.75

Test PH: intéraction

from lifelines.utils import to_episodic_format
from lifelines import CoxTimeVaryingFitter

Transformation de la base YX

long = to_episodic_format(YX, duration_col='stime', event_col='died')

Création de la variable d’intéraction

long['surgery_t'] = long['C(surgery)[1]'] * long['stop']

Estimation

ctv = CoxTimeVaryingFitter()
ctv.fit(long,
        id_col='id',
        event_col='died',
        start_col='start',
        stop_col='stop',)
ctv.print_summary(4)
model lifelines.CoxTimeVaryingFitter
event col 'died'
number of subjects 103
number of periods 31938
number of events 75
partial log-likelihood -287.3290
time fit was run 2024-09-24 06:52:44 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% cmp to z p -log2(p)
C(surgery)[1] -1.7547 0.1730 0.6744 -3.0765 -0.4330 0.0461 0.6486 0.0000 -2.6020 0.0093 6.7533
age 0.0289 1.0293 0.0134 0.0025 0.0552 1.0025 1.0568 0.0000 2.1479 0.0317 4.9785
year -0.1231 0.8842 0.0668 -0.2541 0.0079 0.7756 1.0080 0.0000 -1.8415 0.0656 3.9312
surgery_t 0.0022 1.0022 0.0011 0.0001 0.0044 1.0001 1.0044 0.0000 2.0239 0.0430 4.5400

Partial AIC 582.6581
log-likelihood ratio test 21.5846 on 4 df
-log2(p) of ll-ratio test 12.0103

18.1.3 Modèle à temps discret

18.1.3.1 Ajustement continu

Modèle logistique estimé avec le paquet statsmodel. La fonction to_episodic_format de lifelines permet de mettre en forme la base.
Pour la durée, on utilisera ici la variable mois (regroupement de stime par intervalle de 30 jours).

import statsmodels.formula.api as smf #type R formule => ce qu'on utilisera#
import statsmodels.api as sm #type python#

Transformation de la base en format long

td = pd.read_csv("https://raw.githubusercontent.com/mthevenin/analyse_duree/master/bases/transplantation.csv")
td.drop(['id'], axis=1, inplace=True)
td['dur'] = td['mois']
td = to_episodic_format(td, duration_col='mois', event_col='died')

Evaluation de l’ajustement avec des fonctions quadratiques

td['t2'] = td['stop']**2
td['t3'] = td['stop']**3
fit1 = smf.glm(formula=  "died ~ stop", data=td, family=sm.families.Binomial()).fit()
fit2 = smf.glm(formula=  "died ~ stop + t2", data=td, family=sm.families.Binomial()).fit()
fit3 = smf.glm(formula=  "died ~ stop + t2 + t3", data=td, family=sm.families.Binomial()).fit()

Comparaison des AIC

print("AIC pour ajustement t1")
print(fit1.aic)
print("AIC pour ajustement durée t1 + t2")
print(fit2.aic)
print("AIC pour ajustement durée t1 + t2 + t3")
print(fit3.aic)
AIC pour ajustement t1
504.5211512753311
AIC pour ajustement durée t1 + t2
492.11522432726747
AIC pour ajustement durée t1 + t2 + t3
486.50534103180416

Estimation du modèle

tdfit = smf.glm(formula=  "died ~ stop + t2 + t3 + year + age + surgery", data=td, family=sm.families.Binomial()).fit()
tdfit.summary()
Generalized Linear Model Regression Results
Dep. Variable: died No. Observations: 1127
Model: GLM Df Residuals: 1120
Model Family: Binomial Df Model: 6
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -230.34
Date: Tue, 24 Sep 2024 Deviance: 460.67
Time: 08:52:48 Pearson chi2: 1.30e+03
No. Iterations: 7 Pseudo R-squ. (CS): 0.07732
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 7.0827 5.308 1.334 0.182 -3.320 17.486
stop -0.3721 0.082 -4.516 0.000 -0.534 -0.211
t2 0.0142 0.005 2.835 0.005 0.004 0.024
t3 -0.0002 7.85e-05 -2.113 0.035 -0.000 -1.2e-05
year -0.1327 0.074 -1.798 0.072 -0.277 0.012
age 0.0333 0.015 2.270 0.023 0.005 0.062
surgery -1.0109 0.449 -2.254 0.024 -1.890 -0.132

18.1.3.2 Ajustement discret

Création des intervalles pour l’exemple (quantile de la durée en mois)

td['ct4'] = pd.qcut(td['stop'],[0, .25, .5, .75, 1.]) 
td['ct4'].value_counts(normalize=True)*100
td.ct4 = pd.Categorical(td.ct4)
td['ct4'] = td.ct4.cat.codes

Pour chaque individu, on conserve une seule observation par intervalle.

td2 = td 
td2['t'] = td2['ct4']
td2 = td2.sort_values(['id', 'stop'])
td2 =  td2.groupby(['id','ct4']).last()

Estimation

td2fit = smf.glm(formula=  "died ~ C(t) +  year + age + surgery", data=td2, family=sm.families.Binomial()).fit()
td2fit.summary()
Generalized Linear Model Regression Results
Dep. Variable: died No. Observations: 197
Model: GLM Df Residuals: 190
Model Family: Binomial Df Model: 6
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -111.24
Date: Tue, 24 Sep 2024 Deviance: 222.48
Time: 08:52:48 Pearson chi2: 221.
No. Iterations: 4 Pseudo R-squ. (CS): 0.1808
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 12.4467 6.654 1.871 0.061 -0.594 25.488
C(t)[T.1] -1.0334 0.419 -2.467 0.014 -1.854 -0.212
C(t)[T.2] -1.6152 0.545 -2.965 0.003 -2.683 -0.547
C(t)[T.3] -0.4789 0.599 -0.799 0.424 -1.654 0.696
year -0.2032 0.093 -2.181 0.029 -0.386 -0.021
age 0.0469 0.018 2.533 0.011 0.011 0.083
surgery -1.1102 0.503 -2.209 0.027 -2.095 -0.125

18.1.4 Modèle paramétrique de type AFT

from lifelines import WeibullAFTFitter, LogLogisticAFTFitter

Weibull

aftw = WeibullAFTFitter()
aftw.fit(YX, duration_col='stime', event_col='died')
aftw.print_summary()
model lifelines.WeibullAFTFitter
duration col 'stime'
event col 'died'
number of observations 103
number of events observed 75
log-likelihood -488.17
time fit was run 2024-09-24 06:52:48 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% cmp to z p -log2(p)
lambda_ C(surgery)[1] 1.97 7.17 0.78 0.44 3.50 1.56 33.05 0.00 2.53 0.01 6.45
age -0.06 0.94 0.02 -0.11 -0.01 0.90 0.99 0.00 -2.49 0.01 6.28
year 0.16 1.18 0.12 -0.08 0.40 0.93 1.49 0.00 1.33 0.18 2.44
Intercept -3.02 0.05 8.73 -20.13 14.09 0.00 1.31e+06 0.00 -0.35 0.73 0.46
rho_ Intercept -0.59 0.56 0.09 -0.77 -0.41 0.46 0.67 0.00 -6.33 <0.005 31.93

Concordance 0.65
AIC 986.34
log-likelihood ratio test 18.87 on 3 df
-log2(p) of ll-ratio test 11.75

Loglogistique

aftl = LogLogisticAFTFitter()
aftl.fit(YX, duration_col='stime', event_col='died')
aftl.print_summary()
model lifelines.LogLogisticAFTFitter
duration col 'stime'
event col 'died'
number of observations 103
number of events observed 75
log-likelihood -482.58
time fit was run 2024-09-24 06:52:48 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% cmp to z p -log2(p)
alpha_ C(surgery)[1] 2.27 9.70 0.69 0.92 3.63 2.50 37.56 0.00 3.29 <0.005 9.95
age -0.04 0.96 0.02 -0.08 -0.00 0.92 1.00 0.00 -2.01 0.04 4.48
year 0.24 1.27 0.12 0.01 0.47 1.01 1.60 0.00 2.06 0.04 4.66
Intercept -10.43 0.00 8.34 -26.77 5.92 0.00 372.19 0.00 -1.25 0.21 2.24
beta_ Intercept -0.18 0.84 0.10 -0.37 0.01 0.69 1.01 0.00 -1.86 0.06 3.99

Concordance 0.66
AIC 975.16
log-likelihood ratio test 21.69 on 3 df
-log2(p) of ll-ratio test 13.69

18.2 Package statsmodels

https://www.statsmodels.org/dev/duration.html

Le package permet d’estimer des fonctions de séjour de type Kaplan-Meier et des modèles de Cox.

18.2.1 Kaplan-Meier

km = sm.SurvfuncRight(trans["stime"], trans["died"])
km.summary()
Surv prob Surv prob SE num at risk num events
Time
1 0.990291 0.009661 103 1.0
2 0.961165 0.019037 102 3.0
3 0.932039 0.024799 99 3.0
5 0.912621 0.027825 96 2.0
6 0.893204 0.030432 94 2.0
... ... ... ... ...
852 0.250655 0.048731 14 1.0
979 0.227868 0.049341 11 1.0
995 0.205081 0.049390 10 1.0
1032 0.182295 0.048877 9 1.0
1386 0.151912 0.049277 6 1.0

62 rows × 4 columns

Les test du log-rank sont disponibles avec la fonction survdiff (nom idem R). Au niveau graphique, la programmation semble un peu lourde et mériterait d’être simplifiée (donc non traitée).

Comparaison de S(t) à partir des tests du log-rank


Résultat: (statistique de test, p-value)

Test non pondéré

sm.duration.survdiff(trans.stime, trans.died, trans.surgery)
(6.5900123232343875, 0.010255246157888975)

Breslow

sm.duration.survdiff(trans.stime, trans.died, trans.surgery, weight_type='gb')
(8.989753779902493, 0.0027149757927903417)

Tarone-Ware

sm.duration.survdiff(trans.stime, trans.died, trans.surgery, weight_type='tw')
(8.462352726451392, 0.0036257256194570653)

18.2.2 Modèle de Cox

mod = smf.phreg("stime ~  year + age + surgery ",trans, status='died', ties="efron")
rslt = mod.fit()
print(rslt.summary())
                       Results: PHReg
=============================================================
Model:                    PH Reg       Sample size:       103
Dependent variable:       stime        Num. events:       75 
Ties:                     Efron                              
-------------------------------------------------------------
         log HR log HR SE   HR      t    P>|t|  [0.025 0.975]
-------------------------------------------------------------
year    -0.1196    0.0673 0.8872 -1.7765 0.0757 0.7775 1.0124
age      0.0296    0.0135 1.0300  2.1872 0.0287 1.0031 1.0577
surgery -0.9873    0.4363 0.3726 -2.2632 0.0236 0.1584 0.8761
=============================================================
Confidence intervals are for the hazard ratios