import numpy as np
import pandas as pd
import patsy as pt
import lifelines as lf
import matplotlib.pyplot as plt
import statsmodels as sm
18 Python
- Le document qui suit n’est qu’un programme fait il y a 3-4 ans. Utilisant très peu Python, je n’ai pas documenté les fonctions, plus ou moins obsures, qui ont été utilisées.
- L’exécution a été réalisé le noyau
python3
de jupyter. - Le document n’a pas été compilé en PDF, seule cette version html est disponible.
Deux paquets d’analyse: principalement lifelines
(km, cox, aft…) et `statsmodels``` (estimation logit en temps discret, kaplan-Meier, Cox).
Le package statsmodels
est également ne mesure d’estimer des courbes de séjour de type Kaplan-Meier et des modèles à risque proportionnel de Cox. Le package lifelines
couvre la quasi totalité des méthodes standards, à l’exception des les risques concurrents.
Chargement de la base
= pd.read_csv("https://raw.githubusercontent.com/mthevenin/analyse_duree/master/bases/transplantation.csv")
trans
10)
trans.head( trans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 103 non-null int64
1 year 103 non-null int64
2 age 103 non-null int64
3 died 103 non-null int64
4 stime 103 non-null int64
5 surgery 103 non-null int64
6 transplant 103 non-null int64
7 wait 103 non-null int64
8 mois 103 non-null int64
9 compet 103 non-null int64
dtypes: int64(10)
memory usage: 8.2 KB
18.1 Package lifelines
Documentation: https://lifelines.readthedocs.io/en/latest/
18.1.1 Non Paramétrique: Kaplan Meier
Estimateur KM et durée médiane
= trans['stime']
T = trans['died']
E
from lifelines import KaplanMeierFitter
= KaplanMeierFitter()
kmf
kmf.fit(T,E)print(kmf.survival_function_)
= "DUREE MEDIANE:"
a = kmf.median_survival_time_
b print(a,b)
KM_estimate
timeline
0.0 1.000000
1.0 0.990291
2.0 0.961165
3.0 0.932039
5.0 0.912621
... ...
1400.0 0.151912
1407.0 0.151912
1571.0 0.151912
1586.0 0.151912
1799.0 0.151912
[89 rows x 1 columns]
DUREE MEDIANE: 100.0
kmf.plot()
<AxesSubplot: xlabel='timeline'>
Comparaison des fonctions de survie
= plt.subplot(111)
ax = KaplanMeierFitter()
kmf for name, grouped_df in trans.groupby('surgery'):
'stime'], grouped_df['died'], label=name)
kmf.fit(grouped_df[=ax) kmf.plot(ax
from lifelines.statistics import multivariate_logrank_test
= multivariate_logrank_test(trans['stime'], trans['surgery'], trans['died'])
results results.print_summary()
t_0 | -1 |
null_distribution | chi squared |
degrees_of_freedom | 1 |
test_name | multivariate_logrank_test |
test_statistic | p | -log2(p) | |
---|---|---|---|
0 | 6.59 | 0.01 | 6.61 |
18.2 Semi paramétrique: Cox
18.2.1 Estimation
= 'year + age + C(surgery) -1'
model = pt.dmatrix(model, trans, return_type='dataframe')
X = X.design_info
design_info = X.join(trans[['stime','died']])
YX 'C(surgery)[0]'], axis=1, inplace=True)
YX.drop([
YX.head()
from lifelines import CoxPHFitter
= CoxPHFitter()
cph ='stime', event_col='died')
cph.fit(YX, duration_col
cph.print_summary() cph.plot()
model | lifelines.CoxPHFitter |
duration col | 'stime' |
event col | 'died' |
baseline estimation | breslow |
number of observations | 103 |
number of events observed | 75 |
partial log-likelihood | -289.31 |
time fit was run | 2023-08-21 11:46:42 UTC |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | cmp to | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|---|
C(surgery)[1] | -0.99 | 0.37 | 0.44 | -1.84 | -0.13 | 0.16 | 0.88 | 0.00 | -2.26 | 0.02 | 5.40 |
year | -0.12 | 0.89 | 0.07 | -0.25 | 0.01 | 0.78 | 1.01 | 0.00 | -1.78 | 0.08 | 3.72 |
age | 0.03 | 1.03 | 0.01 | 0.00 | 0.06 | 1.00 | 1.06 | 0.00 | 2.19 | 0.03 | 5.12 |
Concordance | 0.65 |
Partial AIC | 584.61 |
log-likelihood ratio test | 17.63 on 3 df |
-log2(p) of ll-ratio test | 10.90 |
<AxesSubplot: xlabel='log(HR) (95% CI)'>
18.2.2 Tests hypothèse PH
Test PH: Schoenfeld Méthode 1
=0.05) cph.check_assumptions(YX,p_value_threshold
The ``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.
With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for non-constant lines. See link [A] below for a full example.
null_distribution | chi squared |
degrees_of_freedom | 1 |
model | <lifelines.CoxPHFitter: fitted with 103 total ... |
test_name | proportional_hazard_test |
test_statistic | p | -log2(p) | ||
---|---|---|---|---|
C(surgery)[1] | km | 4.01 | 0.05 | 4.47 |
rank | 3.74 | 0.05 | 4.23 | |
age | km | 1.18 | 0.28 | 1.86 |
rank | 1.06 | 0.30 | 1.72 | |
year | km | 2.07 | 0.15 | 2.73 |
rank | 2.08 | 0.15 | 2.75 |
1. Variable 'C(surgery)[1]' failed the non-proportional test: p-value is 0.0452.
Advice: with so few unique values (only 2), you can include `strata=['C(surgery)[1]', ...]` in
the call in `.fit`. See documentation in link [E] below.
---
[A] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Bin-variable-and-stratify-on-it
[C] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
[D] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modify-the-functional-form
[E] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification
[]
Test PH: Schoenfeld Méthode 2
from lifelines.statistics import proportional_hazard_test
= proportional_hazard_test(cph, YX, time_transform='all')
zph zph.print_summary()
null_distribution | chi squared |
degrees_of_freedom | 1 |
model | <lifelines.CoxPHFitter: fitted with 103 total ... |
test_name | proportional_hazard_test |
test_statistic | p | -log2(p) | ||
---|---|---|---|---|
C(surgery)[1] | identity | 5.54 | 0.02 | 5.75 |
km | 4.01 | 0.05 | 4.47 | |
log | 3.69 | 0.05 | 4.19 | |
rank | 3.74 | 0.05 | 4.23 | |
age | identity | 1.61 | 0.20 | 2.29 |
km | 1.18 | 0.28 | 1.86 | |
log | 0.61 | 0.44 | 1.20 | |
rank | 1.06 | 0.30 | 1.72 | |
year | identity | 0.80 | 0.37 | 1.43 |
km | 2.07 | 0.15 | 2.73 | |
log | 1.34 | 0.25 | 2.02 | |
rank | 2.08 | 0.15 | 2.75 |
Test PH: intéraction
from lifelines.utils import to_episodic_format
from lifelines import CoxTimeVaryingFitter
Transformation de la base YX
long = to_episodic_format(YX, duration_col='stime', event_col='died')
Création de la variable d’intéraction
long['surgery_t'] = long['C(surgery)[1]'] * long['stop']
Estimation
= CoxTimeVaryingFitter()
ctv long,
ctv.fit(='id',
id_col='died',
event_col='start',
start_col='stop',)
stop_col4) ctv.print_summary(
model | lifelines.CoxTimeVaryingFitter |
event col | 'died' |
number of subjects | 103 |
number of periods | 31938 |
number of events | 75 |
partial log-likelihood | -287.3290 |
time fit was run | 2023-08-21 11:46:43 UTC |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | cmp to | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|---|
C(surgery)[1] | -1.7547 | 0.1730 | 0.6743 | -3.0764 | -0.4331 | 0.0461 | 0.6485 | 0.0000 | -2.6022 | 0.0093 | 6.7542 |
age | 0.0289 | 1.0293 | 0.0134 | 0.0025 | 0.0552 | 1.0025 | 1.0568 | 0.0000 | 2.1479 | 0.0317 | 4.9785 |
year | -0.1231 | 0.8842 | 0.0668 | -0.2541 | 0.0079 | 0.7756 | 1.0080 | 0.0000 | -1.8415 | 0.0656 | 3.9312 |
surgery_t | 0.0022 | 1.0022 | 0.0011 | 0.0001 | 0.0044 | 1.0001 | 1.0044 | 0.0000 | 2.0239 | 0.0430 | 4.5402 |
Partial AIC | 582.6581 |
log-likelihood ratio test | 21.5846 on 4 df |
-log2(p) of ll-ratio test | 12.0103 |
18.2.3 Variable dynamique binaire
18.3 Modèle à temps discret
18.3.1 Ajustement continu
Modèle logistique estimé avec le paquet statsmodel
. La fonction to_episodic_format
de lifelines
permet de mettre en forme la base.
Pour la durée, on utilisera ici la variable mois (regroupement de stime par intervalle de 30 jours).
import statsmodels.formula.api as smf #type R formule => ce qu'on utilisera#
import statsmodels.api as sm #type python#
Transformation de la base en format long
= pd.read_csv("https://raw.githubusercontent.com/mthevenin/analyse_duree/master/bases/transplantation.csv")
td 'id'], axis=1, inplace=True)
td.drop(['dur'] = td['mois']
td[= to_episodic_format(td, duration_col='mois', event_col='died') td
Evaluation de l’ajustement avec des fonctions quadratiques
't2'] = td['stop']**2
td['t3'] = td['stop']**3
td[= smf.glm(formula= "died ~ stop", data=td, family=sm.families.Binomial()).fit()
fit1 = smf.glm(formula= "died ~ stop + t2", data=td, family=sm.families.Binomial()).fit()
fit2 = smf.glm(formula= "died ~ stop + t2 + t3", data=td, family=sm.families.Binomial()).fit() fit3
Comparaison des AIC
print("AIC pour ajustement t1")
print(fit1.aic)
print("AIC pour ajustement durée t1 + t2")
print(fit2.aic)
print("AIC pour ajustement durée t1 + t2 + t3")
print(fit3.aic)
AIC pour ajustement t1
504.5211512753311
AIC pour ajustement durée t1 + t2
492.11522432726747
AIC pour ajustement durée t1 + t2 + t3
486.50534103180416
Estimation du modèle
= smf.glm(formula= "died ~ stop + t2 + t3 + year + age + surgery", data=td, family=sm.families.Binomial()).fit()
tdfit tdfit.summary()
Dep. Variable: | died | No. Observations: | 1127 |
Model: | GLM | Df Residuals: | 1120 |
Model Family: | Binomial | Df Model: | 6 |
Link Function: | Logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -230.34 |
Date: | Mon, 21 Aug 2023 | Deviance: | 460.67 |
Time: | 13:46:50 | Pearson chi2: | 1.30e+03 |
No. Iterations: | 7 | Pseudo R-squ. (CS): | 0.07732 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
Intercept | 7.0827 | 5.308 | 1.334 | 0.182 | -3.320 | 17.486 |
stop | -0.3721 | 0.082 | -4.516 | 0.000 | -0.534 | -0.211 |
t2 | 0.0142 | 0.005 | 2.835 | 0.005 | 0.004 | 0.024 |
t3 | -0.0002 | 7.85e-05 | -2.113 | 0.035 | -0.000 | -1.2e-05 |
year | -0.1327 | 0.074 | -1.798 | 0.072 | -0.277 | 0.012 |
age | 0.0333 | 0.015 | 2.270 | 0.023 | 0.005 | 0.062 |
surgery | -1.0109 | 0.449 | -2.254 | 0.024 | -1.890 | -0.132 |
18.3.2 Ajustement discret
Création des intervalles pour l’exemple (quantile de la durée en mois)
'ct4'] = pd.qcut(td['stop'],[0, .25, .5, .75, 1.])
td['ct4'].value_counts(normalize=True)*100
td[= pd.Categorical(td.ct4)
td.ct4 'ct4'] = td.ct4.cat.codes td[
Pour chaque individu, on conserve une seule observation par intervalle.
= td
td2 't'] = td2['ct4']
td2[= td2.sort_values(['id', 'stop'])
td2 = td2.groupby(['id','ct4']).last() td2
Estimation
= smf.glm(formula= "died ~ C(t) + year + age + surgery", data=td2, family=sm.families.Binomial()).fit()
td2fit td2fit.summary()
Dep. Variable: | died | No. Observations: | 197 |
Model: | GLM | Df Residuals: | 190 |
Model Family: | Binomial | Df Model: | 6 |
Link Function: | Logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -111.24 |
Date: | Mon, 21 Aug 2023 | Deviance: | 222.48 |
Time: | 13:46:50 | Pearson chi2: | 221. |
No. Iterations: | 4 | Pseudo R-squ. (CS): | 0.1808 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
Intercept | 12.4467 | 6.654 | 1.871 | 0.061 | -0.594 | 25.488 |
C(t)[T.1] | -1.0334 | 0.419 | -2.467 | 0.014 | -1.854 | -0.212 |
C(t)[T.2] | -1.6152 | 0.545 | -2.965 | 0.003 | -2.683 | -0.547 |
C(t)[T.3] | -0.4789 | 0.599 | -0.799 | 0.424 | -1.654 | 0.696 |
year | -0.2032 | 0.093 | -2.181 | 0.029 | -0.386 | -0.021 |
age | 0.0469 | 0.018 | 2.533 | 0.011 | 0.011 | 0.083 |
surgery | -1.1102 | 0.503 | -2.209 | 0.027 | -2.095 | -0.125 |
18.4 Modèle paramétrique de type AFT
from lifelines import WeibullAFTFitter, LogLogisticAFTFitter
Weibull
= WeibullAFTFitter()
aftw ='stime', event_col='died')
aftw.fit(YX, duration_col aftw.print_summary()
model | lifelines.WeibullAFTFitter |
duration col | 'stime' |
event col | 'died' |
number of observations | 103 |
number of events observed | 75 |
log-likelihood | -488.17 |
time fit was run | 2023-08-21 11:46:50 UTC |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | cmp to | z | p | -log2(p) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
lambda_ | C(surgery)[1] | 1.97 | 7.17 | 0.78 | 0.44 | 3.50 | 1.56 | 33.05 | 0.00 | 2.53 | 0.01 | 6.45 |
age | -0.06 | 0.94 | 0.02 | -0.11 | -0.01 | 0.90 | 0.99 | 0.00 | -2.49 | 0.01 | 6.28 | |
year | 0.16 | 1.18 | 0.12 | -0.08 | 0.40 | 0.93 | 1.49 | 0.00 | 1.33 | 0.18 | 2.44 | |
Intercept | -3.02 | 0.05 | 8.73 | -20.13 | 14.09 | 0.00 | 1.31e+06 | 0.00 | -0.35 | 0.73 | 0.46 | |
rho_ | Intercept | -0.59 | 0.56 | 0.09 | -0.77 | -0.41 | 0.46 | 0.67 | 0.00 | -6.33 | <0.005 | 31.93 |
Concordance | 0.65 |
AIC | 986.34 |
log-likelihood ratio test | 18.87 on 3 df |
-log2(p) of ll-ratio test | 11.75 |
Loglogistique
= LogLogisticAFTFitter()
aftl ='stime', event_col='died')
aftl.fit(YX, duration_col aftl.print_summary()
model | lifelines.LogLogisticAFTFitter |
duration col | 'stime' |
event col | 'died' |
number of observations | 103 |
number of events observed | 75 |
log-likelihood | -482.58 |
time fit was run | 2023-08-21 11:46:51 UTC |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | cmp to | z | p | -log2(p) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
alpha_ | C(surgery)[1] | 2.27 | 9.70 | 0.69 | 0.92 | 3.63 | 2.50 | 37.56 | 0.00 | 3.29 | <0.005 | 9.95 |
age | -0.04 | 0.96 | 0.02 | -0.08 | -0.00 | 0.92 | 1.00 | 0.00 | -2.01 | 0.04 | 4.48 | |
year | 0.24 | 1.27 | 0.12 | 0.01 | 0.47 | 1.01 | 1.60 | 0.00 | 2.06 | 0.04 | 4.66 | |
Intercept | -10.43 | 0.00 | 8.34 | -26.77 | 5.92 | 0.00 | 372.19 | 0.00 | -1.25 | 0.21 | 2.24 | |
beta_ | Intercept | -0.18 | 0.84 | 0.10 | -0.37 | 0.01 | 0.69 | 1.01 | 0.00 | -1.86 | 0.06 | 3.99 |
Concordance | 0.66 |
AIC | 975.16 |
log-likelihood ratio test | 21.69 on 3 df |
-log2(p) of ll-ratio test | 13.69 |
19 Package statsmodels
https://www.statsmodels.org/dev/duration.html
Le package permet d’estimer des fonction de séjour de type Kaplan-Meier et des modèles de Cox.
19.1 Kaplan-Meier
= sm.SurvfuncRight(trans["stime"], trans["died"])
km km.summary()
Surv prob | Surv prob SE | num at risk | num events | |
---|---|---|---|---|
Time | ||||
1 | 0.990291 | 0.009661 | 103 | 1.0 |
2 | 0.961165 | 0.019037 | 102 | 3.0 |
3 | 0.932039 | 0.024799 | 99 | 3.0 |
5 | 0.912621 | 0.027825 | 96 | 2.0 |
6 | 0.893204 | 0.030432 | 94 | 2.0 |
... | ... | ... | ... | ... |
852 | 0.250655 | 0.048731 | 14 | 1.0 |
979 | 0.227868 | 0.049341 | 11 | 1.0 |
995 | 0.205081 | 0.049390 | 10 | 1.0 |
1032 | 0.182295 | 0.048877 | 9 | 1.0 |
1386 | 0.151912 | 0.049277 | 6 | 1.0 |
62 rows × 4 columns
Les test du log-rank sont disponibles avec la fonction survdiff
(nom idem R). Au niveau graphique, la programmation semble un peu lourde et mériterait d’être simplifiée (donc non traitée).
Comparaison de S(t) à partir des tests du log-rank
Résultat: (statistique de test, p-value)
Test non pondéré
sm.duration.survdiff(trans.stime, trans.died, trans.surgery)
(6.5900123232343875, 0.010255246157888975)
Breslow
='gb') sm.duration.survdiff(trans.stime, trans.died, trans.surgery, weight_type
(8.989753779902493, 0.0027149757927903417)
Tarone-Ware
='tw') sm.duration.survdiff(trans.stime, trans.died, trans.surgery, weight_type
(8.462352726451392, 0.0036257256194570653)
19.2 Modèle de Cox
= smf.phreg("stime ~ year + age + surgery ",trans, status='died', ties="efron")
mod = mod.fit()
rslt print(rslt.summary())
Results: PHReg
=============================================================
Model: PH Reg Sample size: 103
Dependent variable: stime Num. events: 75
Ties: Efron
-------------------------------------------------------------
log HR log HR SE HR t P>|t| [0.025 0.975]
-------------------------------------------------------------
year -0.1196 0.0673 0.8872 -1.7765 0.0757 0.7775 1.0124
age 0.0296 0.0135 1.0300 2.1872 0.0287 1.0031 1.0577
surgery -0.9873 0.4363 0.3726 -2.2632 0.0236 0.1584 0.8761
=============================================================
Confidence intervals are for the hazard ratios