Statsmodels where columns have spaces

Statsmodels where columns have spaces#

Back at the simple and multiple regression page, we introduced the StatsModels library.

As you may remember from that page, we were looking for the best least-squares straight-line fit between two columns of values in patients with chronic kidney disease:

ckd = pd.read_csv('ckd.csv')
ckd.head()

	Age	Blood Pressure	Specific Gravity	Albumin	Sugar	Red Blood Cells	Pus Cell	Pus Cell clumps	Bacteria	Blood Glucose Random	...	Packed Cell Volume	White Blood Cell Count	Red Blood Cell Count	Hypertension	Diabetes Mellitus	Coronary Artery Disease	Appetite	Pedal Edema	Anemia	Class
0	48	70	1.005	4	0	normal	abnormal	present	notpresent	117	...	32	6700	3.9	yes	no	no	poor	yes	yes	1
1	53	90	1.020	2	0	abnormal	abnormal	present	notpresent	70	...	29	12100	3.7	yes	yes	no	poor	no	yes	1
2	63	70	1.010	3	0	abnormal	abnormal	present	notpresent	380	...	32	4500	3.8	yes	yes	no	poor	yes	no	1
3	68	80	1.010	3	2	normal	abnormal	present	present	157	...	16	11000	2.6	yes	yes	yes	poor	yes	no	1
4	61	80	1.015	2	0	abnormal	abnormal	notpresent	notpresent	173	...	24	9200	3.2	yes	yes	yes	poor	yes	yes	1

5 rows × 25 columns

In particular we were interested to predict the blood concentration of creatinine (a marker of kidney failure) from the blood concentration of urea (another marker of kidney failure), in patients with chronic kidney disease.

Notice that creatinine and urea columns are "Serum Creatinine" and "Blood Urea" respectively. Both columns have spaces in their names.

We first select our columns of interest, and restrict ourselves to rows for patients with chronic kidney disease (CKD):

# Data frame restricted to kidney patients and columns of interest.
ckdp = ckd.loc[
    ckd['Class'] == 1,   # rows for CKD
    ['Serum Creatinine', 'Blood Urea']]  # columns of interest
ckdp.head()

	Serum Creatinine	Blood Urea
0	3.8	56
1	7.2	107
2	2.7	60
3	4.1	90
4	3.9	148

ckdp.plot.scatter('Blood Urea', 'Serum Creatinine')

<Axes: xlabel='Blood Urea', ylabel='Serum Creatinine'>

../_images/4f8ff3ec3ef68bb6e7b8e1bb61a0f2937473e1226ef7068c543eabe54dd46716.png

In the single and multiple regression page, we started by renaming the columns, where the new names did not have spaces.

ckdp_renamed = ckdp.copy()
# Rename the columns to names without spaces.
ckdp_renamed.columns = ['Creatinine', 'Urea']
ckdp_renamed.head()

	Creatinine	Urea
0	3.8	56
1	7.2	107
2	2.7	60
3	4.1	90
4	3.9	148

We then ran a simple regression model in Statsmodels, to find the least-squares straight line.

simple_model = smf.ols(formula="Creatinine ~ Urea", data=ckdp_renamed)
simple_fit = simple_model.fit()
simple_fit.summary()

OLS Regression Results
Dep. Variable:	Creatinine	R-squared:	0.716
Model:	OLS	Adj. R-squared:	0.709
Method:	Least Squares	F-statistic:	103.1
Date:	Wed, 05 Jun 2024	Prob (F-statistic):	9.33e-13
Time:	16:06:01	Log-Likelihood:	-95.343
No. Observations:	43	AIC:	194.7
Df Residuals:	41	BIC:	198.2
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-0.0849	0.668	-0.127	0.899	-1.434	1.264
Urea	0.0552	0.005	10.155	0.000	0.044	0.066

Omnibus:	2.409	Durbin-Watson:	1.303
Prob(Omnibus):	0.300	Jarque-Bera (JB):	1.814
Skew:	0.503	Prob(JB):	0.404
Kurtosis:	3.043	Cond. No.	236.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

But, if we wanted to use the original column names, we would have to do some extra work to make Statsmodels accept column names with spaces. And in fact we have to do the same thing if there are special characters, which, like the spaces would make the column names invalid as variable names.

For example, let’s say we were using the DataFrame ckdp with the original column names. We could try this:

# This generates an error, because the Statsmodels formula interface
# needs column names that work as variable names.
another_model = smf.ols(formula="Serum Creatinine ~ Blood Urea",
                        data=ckdp)

Traceback (most recent call last):

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/IPython/core/interactiveshell.py:3577 in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  Cell In[7], line 3
    another_model = smf.ols(formula="Serum Creatinine ~ Blood Urea",

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/statsmodels/base/model.py:203 in from_formula
    tmp = handle_formula_data(data, None, formula, depth=eval_env,

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/statsmodels/formula/formulatools.py:63 in handle_formula_data
    result = dmatrices(formula, Y, depth, return_type='dataframe',

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/patsy/highlevel.py:309 in dmatrices
    (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/patsy/highlevel.py:164 in _do_highlevel_design
    design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/patsy/highlevel.py:66 in _try_incr_builders
    return design_matrix_builders([formula_like.lhs_termlist,

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/patsy/build.py:689 in design_matrix_builders
    factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/patsy/build.py:354 in _factors_memorize
    which_pass = factor.memorize_passes_needed(state, eval_env)

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/patsy/eval.py:478 in memorize_passes_needed
    subset_names = [name for name in ast_names(self.code)

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/patsy/eval.py:478 in <listcomp>
    subset_names = [name for name in ast_names(self.code)

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/patsy/eval.py:109 in ast_names
    for node in ast.walk(ast.parse(code)):

  File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/ast.py:50 in parse
    return compile(source, filename, mode, flags,

  File <unknown>:1
    Serum Creatinine
          ^
SyntaxError: invalid syntax

The solution is to use the Q() (Quote) function in your formula. It tells Statsmodels that you mean the words ‘Serum’ and ‘Creatinine’ to be one thing: ‘Serum Creatinine’ - the name of the column.

another_model = smf.ols(formula="Q('Serum Creatinine') ~ Q('Blood Urea')",
                        data=ckdp)
another_fit = another_model.fit()
another_fit.summary()

OLS Regression Results
Dep. Variable:	Q('Serum Creatinine')	R-squared:	0.716
Model:	OLS	Adj. R-squared:	0.709
Method:	Least Squares	F-statistic:	103.1
Date:	Wed, 05 Jun 2024	Prob (F-statistic):	9.33e-13
Time:	16:06:01	Log-Likelihood:	-95.343
No. Observations:	43	AIC:	194.7
Df Residuals:	41	BIC:	198.2
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-0.0849	0.668	-0.127	0.899	-1.434	1.264
Q('Blood Urea')	0.0552	0.005	10.155	0.000	0.044	0.066

Omnibus:	2.409	Durbin-Watson:	1.303
Prob(Omnibus):	0.300	Jarque-Bera (JB):	1.814
Skew:	0.503	Prob(JB):	0.404
Kurtosis:	3.043	Cond. No.	236.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.