Design of Experiments (DOE): The Full Factorial Tutorial
The Full Factorial design is foundational to the theory of Design of Experiments (DOE). Although it is not the most efficient in terms of time or resources, understanding this design is essential before moving on to more advanced and economical techniques. A full factorial experiment tests every combination of factors at the same number of levels, which allows for all main effects and interactions to be measured directly. In practice, however, full factorial designs can quickly become costly and labor intensive when as the number of factors and levels increases. As a result, this design is only well suited for smaller studies.
In this tutorial, we will walk through a complete example of a full factorial DOE experiment. Using freely available Python libraries, we will generate the design matrix, explore the data, build a regression model, and interpret the results.
- Overview
- Full Factorial Tutorial
The Python Setup
Before we begin, let’s make sure we have all of the tools that we need. To set up a python environment capable of running the code in this tutorial, use the following YAML setup file for conda. If you are not familiar, read about setting up reproducible environments with conda in a previous note.
name: doe
channels:
- defaults
- conda-forge
dependencies:
- python=3.10
- pip
- anaconda
- pydoe3
Once the environment is created, let’s start our Python script by importing all of the packages that we’ll need:
import pyDOE3 as doe
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.transforms import offset_copy
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.anova import anova_lm
import itertools
Here is what each package is used for:
- pyDOE3 - generates DOE designs
- numpy - performs efficient numeric calculations
- pandas - helps to wrangle tabular data
- matplotlib and seaborn - create plots and statistical visualizations
- scipy - offers a variety of scientific tools. Specifically, the stats sub-module offers statistical functions, including probability distributions and hypothesis tests
- statsmodels - does a lot of the heavy lifting in creating our statistical model
- itertools - helps to efficiently generate combinations of our factors and levels
The Example Scenario
Let’s imagine an example scenario. A chemist has developed a new reaction process to synthesize a specialty compound. The process had worked well on the small-scale in the lab, but the initial upscaled production batches have shown poor and inconsistent reaction yields. Instead of adjusting one variable at a time, the chemist decides to use a DOE approach to study how process parameters influence the final reaction yields. Wisely, he decides to follow the DOE steps from our overview note to perform a screening study to uncover the importance of factors that govern the yields of his chemical process.
Step 1: Define the Objectives
Carefully considering his current needs, the chemist defines the following objectives for his screening DOE study:
- Identify which process parameters significantly affect the reaction yield.
- Understand if any of the process parameters interact with one another.
Step 2: Identify Factors and Levels
Based on his experience from small-scale development experiments, the chemist believes that five process parameters, or experimental factors, may be responsible for yield fluctuations: precursor concentration, reaction temperature, solution pH, mixing speed, and reaction time.
Since this is a screening study, he decides to test each factor at two levels, i.e., to capture the low and high settings for each of the process parameters. Again, relying on his past knowledge of the process, he settles on the following values:
| Low Setting | High Setting | |
|---|---|---|
| Concentration | 20 µM | 120 µM |
| Temperature | 5 °C | 20 °C |
| pH | 5 | 7 |
| Mixing | 10 rpm | 50 rpm |
| Time | 3 hours | 12 hours |
Let’s record our design parameters in python:
factors = {'concentration': [20, 120],
'temperature': [5, 20],
'pH': [7, 5],
'mixing': [10, 50],
'time': [3, 12]}
units = {'concentration': 'µM',
'temperature': '°C',
'pH': '',
'mixing': 'rpm',
'time': 'h',
'result': 'mg'}
For convenience, let’s create some helper variables and make sure the data is sorted properly.
main_effects = list(factors.keys())
number_of_levels = []
for factor in main_effects:
# make sure the levels are sorted
factors[factor] = sorted(factors[factor])
# count levels for each factor
number_of_levels.append(len(factors[factor]))
Step 3: Select a Design
The chemist suspects that the poor yields of his process are caused not only by the direct influence of each process parameter, but also through their hidden interactions. Guided by this, he selects a full factorial design to comprehensively test the complete set of combinations of the factor levels. This approach will allow the chemist to quantify both the main effects of each factor and the interaction effects between factors.
The number of experiments required for a full factorial design increases exponentially, following the relation \( \text{Runs} = \text{Levels}^{\text{Factors}} \). Because of this rapid increase, most practical full factorial experiments use only two levels per factor. In our case, the chemist needs to test \(2^5=32\) conditions, and this is before any replication!
To create the full factorial design table, we will use the pyDOE3 package.
design = doe.fullfact(number_of_levels).astype('int')
If we look inside the design variable at this point, we will see the following numpy array.
array([[0, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 1, 0, 0],
[0, 0, 0, 1, 0],
[1, 0, 0, 1, 0],
[0, 1, 0, 1, 0],
[1, 1, 0, 1, 0],
[0, 0, 1, 1, 0],
[1, 0, 1, 1, 0],
[0, 1, 1, 1, 0],
[1, 1, 1, 1, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[0, 1, 0, 0, 1],
[1, 1, 0, 0, 1],
[0, 0, 1, 0, 1],
[1, 0, 1, 0, 1],
[0, 1, 1, 0, 1],
[1, 1, 1, 0, 1],
[0, 0, 0, 1, 1],
[1, 0, 0, 1, 1],
[0, 1, 0, 1, 1],
[1, 1, 0, 1, 1],
[0, 0, 1, 1, 1],
[1, 0, 1, 1, 1],
[0, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])
The generated full factorial array represents all 32 combinations of low and high levels (coded as 0, 1) for each of our five factors. To make it a little easier to understand and process, let’s convert the coded levels to actual values, and make it into a pretty pandas table.
df = pd.DataFrame(design, columns=main_effects)
# convert the coded values to string, to make
# replacement with actual numerical values
# less prone to code/value confusion
df = df.astype(str)
for factor in factors:
# replace codes with actual values and convert
# back to floats.
for i, level in enumerate(factors[factor]):
df[factor] = df[factor].replace(str(i), float(level))
df[factor] = df[factor].astype('float')
Which gives us a nice pandas dataframe:
| concentration | temperature | pH | mixing | time |
|---|---|---|---|---|
| 20 | 5 | 5 | 10 | 3 |
| 120 | 5 | 5 | 10 | 3 |
| 20 | 20 | 5 | 10 | 3 |
| 120 | 20 | 5 | 10 | 3 |
| 20 | 5 | 7 | 10 | 3 |
| 120 | 5 | 7 | 10 | 3 |
| 20 | 20 | 7 | 10 | 3 |
| 120 | 20 | 7 | 10 | 3 |
| 20 | 5 | 5 | 50 | 3 |
| 120 | 5 | 5 | 50 | 3 |
| 20 | 20 | 5 | 50 | 3 |
| 120 | 20 | 5 | 50 | 3 |
| 20 | 5 | 7 | 50 | 3 |
| 120 | 5 | 7 | 50 | 3 |
| 20 | 20 | 7 | 50 | 3 |
| 120 | 20 | 7 | 50 | 3 |
| 20 | 5 | 5 | 10 | 12 |
| 120 | 5 | 5 | 10 | 12 |
| 20 | 20 | 5 | 10 | 12 |
| 120 | 20 | 5 | 10 | 12 |
| 20 | 5 | 7 | 10 | 12 |
| 120 | 5 | 7 | 10 | 12 |
| 20 | 20 | 7 | 10 | 12 |
| 120 | 20 | 7 | 10 | 12 |
| 20 | 5 | 5 | 50 | 12 |
| 120 | 5 | 5 | 50 | 12 |
| 20 | 20 | 5 | 50 | 12 |
| 120 | 20 | 5 | 50 | 12 |
| 20 | 5 | 7 | 50 | 12 |
| 120 | 5 | 7 | 50 | 12 |
| 20 | 20 | 7 | 50 | 12 |
| 120 | 20 | 7 | 50 | 12 |
We now have our design matrix! But before we begin our experiments, we want to add replication and randomize the order of tests:
-
Replication is the planned repetition of experimental runs under the same conditions used to estimate the natural variability of the process and to improve the precision of our statistical analysis. By including replicates, we can separate the variation caused by measurement noise or uncontrolled factors from the variation caused by the factors we are studying. The chemist decides to perform each experiment in duplicate, increasing the total number of tests to 64.
-
Randomization is equally important because it helps to protect the study from hidden biases. If all runs were performed in a fixed sequence, external influences such as ambient temperature drift, equipment wear, or operator fatigue could create systematic errors. Randomizing the run order ensures that these potential disturbances are spread evenly across all treatment combinations, allowing for a more reliable and unbiased estimate of the true factor effects. Proper “blocking” and randomization of experiments is an advanced topic that we will revisit in a future note. For now, we will just shuffle all of our conditions in the dataframe.
replicate_number = 2
# copy and append the design matrix to itself
# to add a desired number of replicates
replicate_df = pd.DataFrame()
for i in range(0, replicate_number):
replicate_df = pd.concat([replicate_df, df])
df = replicate_df.reset_index(drop=True)
# Randomize the order
df = df.sample(frac=1, random_state=88).reset_index(drop=True)
Note that during the randomization we seeded our random generator with the number 88. This is done for the sake of reproducibility of this tutorial, but is not needed in practice.
Step 4: Conduct the Experiments
Now comes the hard part. The chemist needs to go into the production facility and run the 64 experiments he planned. Each experiment produces a result (the yield of the reaction in mg of the compound) which is recorded as the response variable. Once the results are obtained, we import them into our dataframe:
df['result'] = [3.8, 8.9, 1.5, 10.5, 8.0,
8.2, 3.7, 2.2, 3.8, 2.8,
8.4, 9.2, 5.0, 5.8, 4.5,
5.2, 6.5, 10.3, 8.9, 1.4,
9.6, 10.7, 4.7, 10.1, 2.8,
6.9, 8.3, 6.7, 1.1, 2.4,
3.1, 8.8, 5.0, 6.3, 4.2,
7.9, 2.2, 7.2, 9.7, 10.2,
1.8, 3.8, 1.7, 7.1, 9.1,
11.1, 4.0, 7.0, 2.8, 9.4,
5.3, 2.2, 6.6, 4.7, 8.0,
3.4, 7.2, 8.2, 3.4, 4.3,
8.6, 9.3, 2.5, 5.8]
Step 5: Analyze the Data
Exploration
Let’s explore the data to understand the variance and the distribution of the response values. Before we begin, we should note that a factorial design intentionally creates multimodal response mixes, which means that we should not be expecting a perfectly normal distribution of values in the result dataset. However, a roughly normal response distribution suggests that the experimental system is stable and free of strong skew or outliers, while significantly non-normal data may point to measurement issues, nonlinear effects, or problems with experimental blocking. Detecting these patterns early helps to determine whether any additional data transformations are needed before we start building our statistical model.
Shapiro-Wilk Test
We will start with the Shapiro–Wilk test, which compares the order of the observed data to the expected values from a perfectly normal distribution. The test sets out with a null hypothesis that the data is normally distributed. A resulting p-value greater than 0.05 confirms this hypothesis, suggesting that the data do not significantly deviate from normality. Conversely, a p-value below 0.05 indicates that the results are not normally distributed.
print(f"Shapiro–Wilk p-value: {stats.shapiro(df['result']).pvalue:.2f}")
This code produces “Shapiro–Wilk p-value: 0.0107”, confirming that our data is not normally distributed, as expected.
Box Plot
Let’s visualize the distribution of our results with a simple box plot.
fig, ax = plt.subplots()
sns.boxplot(data=df, y='result', ax=ax)
sns.stripplot(data=df, y='result', ax=ax, jitter=True, alpha=0.6, c='k')
plt.show()
The plot shows that the results are distributed evenly. All quartiles are approximately of the same size and symmetrically distributed around the median value, i.e., no skewing. There are also no significant outliers. So far, there is not a clear explanation for why the distribution of the data is not normal.
Distribution Histogram
Let’s investigate the modality of our data using a histogram plot. If our data is normally distributed, it should have a single peak with symmetric tails.
g = df[['result']].plot(kind='hist')
g.set_xlabel('Result Bins')
It seems our results are bi-modal, as the histogram shows two peaks. Multi-modality of data in DOE occurs when the response variable changes systematically across the levels of experimental factors, producing clusters of results rather than a single smooth distribution. In other words, the two peaks may suggest that that one of the factors in our design dominates all others.
Normal Q-Q Probability Plot
The histogram showed us that our data has systematic groupings, but how far from a normal distribution does it actually stray? For this, let’s look at a normal quantile-quantile (Q-Q) probability plot. This plot projects our results on to a diagonal line that represents a perfect normal distribution.
qqplot(df['result'], line='s')
plt.title('Normal Q–Q plot of Response Variable')
plt.show()
The slight “zigzag” pattern of the data points around the diagonal reflects the two peaks observed in the histogram. There are also signs of minor tailing at both the low and high ends of the responses. However, the overall distribution of the response variable appears to follow normality quite closely.
Why does the histogram disagree with the Q-Q plot? Unlike the histogram, which highlights the local structure of the data, the Q–Q plot compares the cumulative distribution of the data to that of an ideal normal distribution, emphasizing the overall agreement with normality across the full range. In a sense, our systematic sampling of data around the two levels per factor may have created a sampling bias in our experiment, leading to the perceived multi-modality of the results. There is a good chance that if we sampled our design space more randomly or more completely, the histogram of the response would fill out to yield a more symmetric distribution.
We conclude that our data is reasonably consistent with a normal distribution, and no data transformations are required at this stage.
Run Order Plot
Before we start looking into the effects of our experimental factors, let’s check for signs of uncontrolled effects. A run order plot shows the response values in the sequence that the experimental runs were performed. It helps to detect trends or patterns over time, such as instrument drift, temperature changes, or operator effects. Ideally, the points should appear randomly scattered, indicating that the process was stable and that randomization successfully minimized time-related bias.
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(x=df.index, y=df.result)
ax.set_xlabel('Run Order')
ax.set_ylabel('Result')
plt.show()
The variability of our results seems to be consistent throughout the study. There are no upward or downward trends, and the vertical spread of the data does not seem to expand of narrow with the progression of the runs. It seems that our experimental environment did not have any significant uncontrolled factors.
Ranking Effects and Interactions
Now that we’ve confirmed that the results are (approximately) normally distributed and free from obvious biases, we can start exploring how each factor influences the reaction yield. Before we start building our statistical model, we will continue exploring the model graphically, to get a sense of the importance of each model parameter.
Main Effects Plot
Let’s plot the average response at each level of a single factor to quickly see which variables have the strongest influence on the result.
result_mean = df.result.mean()
level_means = {}
deltas = {}
for factor in main_effects:
level_means[factor] = (df[[factor, 'result']]
.groupby(factor)
.result.mean()
.reset_index()
.rename({factor: "level"}, axis=1))
deltas[factor] = (level_means[factor].result.max()
- level_means[factor].result.min())
main_effects = sorted(list(main_effects),
key=lambda x: deltas[x],
reverse=True)
ncols = len(level_means.keys())
fig, axes = plt.subplots(figsize=(12,6),
ncols=ncols,
nrows=1,
sharex=False,
sharey=True)
for i, ax in enumerate(axes.flat):
factor = main_effects[i]
dataset = level_means[factor]
delta = deltas[factor]
ax.plot(dataset['level'],
dataset['result'],
color='tab:blue',
marker='o',
linestyle='solid')
ax.set_title(f'{factor} (Δ={round(float(delta),1)})')
ax.set_xlabel(f'{factor} ({units[factor]})')
ax.set_ylabel(f'Mean Result ({units["result"]})')
ax.axhline(y=result_mean, c='k', ls='--')
plt.show()
Looking at the slopes of the three lines, it seems that concentration has the highest positive effect on the reaction yield. Mixing also affects the yield in a positive way, but to a much smaller degree. Lastly, temperature has a small but negative effect on the reaction yield. Time and pH seem to have little influence on the reaction yield.
This line-based presentation is common in DOE, because it provides a simple summary of the main effects. A slightly more informative way of plotting the main effects is with box plots separated by level.
fig, axes = plt.subplots(figsize=(9,15),
ncols=1,
nrows=len(main_effects),
sharex=False,
sharey=True)
for i, effect in enumerate(main_effects):
ax = axes.flat[i]
sns.boxplot(data=df, x=effect, y='result', ax=ax)
fig.tight_layout()
plt.show()
We see the same trends, but we now have a better sense of variability at each level. Concentration seems to be very important, as it is the only effect where the level-separated box plots are clearly differentiated from each other. This means that variability from concentration dominates variability coming from all other factors. Mixing and temperature still show some effects, but they are much more convoluted by variability coming from the other factors.
Interaction Effects Plot
One of the greatest advantages of DOE is its ability to reveal interaction effects, showing how factors influence each other rather than acting independently. The usual one-factor-at-a-time testing cannot provide such insights. Factor interactions occur when the effect of one factor on the response depends on the level of another factor. In other words, the influence of one variable is not constant but changes in combination with others. The simplest type is a two-factor interaction, where two factors jointly influence the response in a way that cannot be explained by their individual effects alone. For example, the effect of temperature on yield might be stronger at high mixing speed than at low mixing speed. More complex interactions can also occur. A three-factor interaction means that the two-factor interaction itself changes depending on the level of a third factor, and so on for higher orders. While higher-order interactions are mathematically possible, they are often difficult to interpret and rarely significant in practical experiments. As a result, in our tutorial we will focus on two-factor interactions only.
In DOE, it is common to visualize interaction effects using a grouped linear plot.
combinations = list(itertools.combinations(factors, 2))
fig, axes = plt.subplots(figsize=(9,5),
ncols=len(combinations),
nrows=1,
sharex=False, sharey=True)
for i, combination in enumerate(combinations):
ax = axes.flat[i]
ax.set_title(f"{combination[0]} : {combination[1]}")
sub = df.copy()
# normalize the result to "overlay" the
# lines on top of each other to make
# slope comparison easier
sub['normalized_average_result'] = (sub.result
/ sub.groupby(combination[1])
.result.transform('mean'))
sns.pointplot(data=sub, x=combination[0],
y='normalized_average_result',
hue=combination[1],
errorbar=None,
ax=ax)
plt.show()
In this plot, each panel corresponds to a pair of factors, with one factor on the x-axis and the other represented by color-coded lines. The y-axis shows the normalized average response, adjusted so that differences in scale between levels are easier to compare. These plots provide an intuitive visual way to identify which pairs of factors may warrant closer examination in the statistical model. If the lines in a plot are roughly parallel, it suggests that the two factors act independently, meaning there is little or no interaction between them. However, if the lines cross or diverge, it indicates that the effect of one factor depends on the level of the other, and an interaction is likely present. The greater the deviation from parallelism, the stronger the interaction effect.
In our example, most lines are fairly parallel, suggesting no interaction. However, for temperature : time the lines do diverge significantly, suggesting that interactions may be present.
Let’s summarize our exploration so far:
- Our data seems to follow a normal distribution quite closely.
- Concentration seems to have the strongest effect on the result.
- Mixing and temperature seem to have smaller, but meaningful effects.
- There may be interactions between temperature & time.
Statistical Modeling
Exploratory plots help reveal patterns in the data, but modeling goes further by quantifying those relationships and testing their statistical significance. While visual inspection can suggest which factors might be important, a fitted model estimates how much each factor contributes to the response, identifies interactions, and separates real effects from random noise. Modeling also enables prediction of untested conditions, estimation of confidence intervals, and optimization of the response within the design space.
Modeling in DOE is an iterative process. The first fitted model rarely provides the final answer but serves as a guide for refinement. Early results reveal which factors and interactions are significant, highlight unexpected relationships, and sometimes uncover issues such as curvature or outliers. Based on these insights, the model can be improved by removing weak terms, adding higher-order effects, or adjusting the experimental design to explore new regions of the design space. This cycle of building, evaluating, and refining continues until the model reliably represents the process and supports accurate predictions.
The goal of model selection is to balance simplicity and accuracy. A model should capture the key effects and interactions without overfitting random variation. For a full factorial design, the process typically begins with a linear model that includes all main effects and two-factor interactions. Higher-order terms can then be added if the initial model does not explain enough of the observed variability.
Initial Model
Let’s use the formula API from statsmodels to build our linear equation:
model_equation = "result ~ "
for param in main_effects:
model_equation += f"{param} + "
for combination in combinations:
model_equation += f"{combination[0]}:{combination[1]} + "
model_equation = (model_equation[:-3]
if model_equation[-3:] == " + "
else model_equation)
print(model_equation)
Looking inside the model_equation variable we get the following linear equation (in Patsy notation):
result ~ concentration + mixing + temperature + time + pH
+ concentration:temperature + concentration:pH
+ concentration:mixing + concentration:time
+ temperature:pH + temperature:mixing
+ temperature:time + pH:mixing
+ pH:time + mixing:time
It should be noted that statsmodels includes an intercept term by default.
Now that we have the equation with all of the desired terms, let’s create the model using the Ordinary Least Squares (OLS) fitting approach.
model = smf.ols(model_equation, data=df).fit()
summary = model.summary2()
The summary variable contains the descriptive statistics about our model. Let’s spend some time understanding it.
General Model Fit Statistics:
- Df Model (degrees of freedom for the model): number of estimated parameters that are not the intercept.
- Df Residuals (degrees of freedom for residuals): the number of independent pieces of information left to estimate the variability of the error term after the model’s parameters have been fitted.
- R-squared: fraction of variance in the response explained by the model.
- Adj. R-squared: R-squared adjusted for model size, useful for comparing models with different numbers of terms.
- AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): information criteria that trade off fit and complexity. Lower values indicate a better model. We will use these values to judge how our model improves with refinement.
- Log-Likelihood: likelihood of the data under the fitted model. Larger is better, used by AIC and BIC.
- F-statistic and Prob (F-statistic): overall test that at least one coefficient (beyond the intercept) is nonzero.
- Scale: estimate of residual variance (often the mean squared error).
Model Coefficients:
- Coef.: estimated effect for each term, including the intercept which is included by default.
- Std.Err.: standard error of the estimate.
- t and P>|t|: t test and p-value for the null hypothesis that the coefficient equals zero.
- [0.025, 0.975]: 95 percent confidence interval for the coefficient.
Diagnostic Tests:
- Durbin–Watson: checks autocorrelation of residuals. Values near 2 suggest little autocorrelation.
- Omnibus and Prob(Omnibus): joint test of skewness and kurtosis of residuals.
- Jarque–Bera, Prob(JB), Skew, Kurtosis: normality-related diagnostics for residuals.
- Cond. No.: condition number of the design matrix. Large values suggest multicollinearity or scaling issues.
A few important take-aways from this summary:
- Adj. R-squared: is fairly high, suggesting that our model explains the vast majority of variance in the data. This is great!
- Prob (F-statistic) is very small, suggesting that at least one of the parameters in our model is significant.
- Prob(Omnibus) and Prob(JB) are both high, suggesting that the residuals of the model are normally distributed (i.e., random). This means that the model is fitting the data in a balanced and comprehensive way.
In DOE, another common way to evaluate the model is through an ANOVA table:
aov = anova_lm(model, typ=3)
# scientific notation is hard to read
aov['PR(>F)'] = aov['PR(>F)'].round(4)
This table summarizes how total variation in the response is split among model terms and residual error, listing degrees of freedom, sums of squares, mean squares, F-statistics, and p-values to test which terms explain significant variation.
We specifically select a table with “Type III” sums of squares. This makes it so that each term is adjusted for all other terms in the model, which is important when we want to consider term interactions.
Looking at the coefficients table, we see that a few of the factors have PR(>F) values below 0.05, suggesting that they are significant. These individual p-values are a good estimate of significance, but for us to draw conclusions we should perform more formal statistical tests that employ multiple comparison correction. Let’s use the Bonferroni correction method and plot a Pareto diagram to visualize our results (not to be confused with the 80-20 Pareto principle).
alpha=0.05
idx = pd.Index(model.params.index)
keep = ~idx.str.lower().isin(['intercept','const'])
names = idx[keep]
tvals = model.tvalues[names]
abs_t = tvals.abs().sort_values()
df_e = int(model.df_resid)
t_alpha = stats.t.ppf(1 - alpha/2, df_e)
t_alpha_bonf = stats.t.ppf(1 - (alpha/len(names))/2, df_e)
fig, ax = plt.subplots(figsize=(8,6))
ax.barh(abs_t.index, abs_t.values)
ax.set_xlabel('Absolute standardized effect (|t|)')
ax.set_title('Effect (Pareto) Plot — Standardized Effects')
ax.axvline(t_alpha, linestyle='--', color='k',
linewidth=1, label=f'α={alpha:.2g}')
if t_alpha_bonf is not None:
ax.axvline(t_alpha_bonf, linestyle=':', color='k',
linewidth=1,
label=f'Bonferroni α/m={alpha/len(names):.2g}')
ax.legend(loc='upper right')
ax.invert_yaxis()
plt.tight_layout()
plt.show()
As we saw in data exploration, concentration dominates all of the main effects. Mixing also seems to have a strong effect, while time seems to show a small effect that does not make it past our adjusted significance threshold. What is most surprising about these results is that the temperature:time interaction seems to be very strong, with an effect size that is second only to concentration. What is more interesting, is that the time coefficient is positive, while that of the temperature:time is negative, suggesting that the full effect of time may have been somewhat obscured by the interaction. The presence of such a strong interaction is what usually makes the behaviors of systems hard to predicts, and this is exactly the case where DOE shines!
While we are exploring our fit, let’s also look at partial regression plots to confirm the findings from the Pareto plot.
fig = plt.figure(figsize=(9, 18))
g = sm.graphics.plot_partregress_grid(model, fig=fig)
These partial regression plots (sometimes called “added variable plots”) show the unique relationship between the response and one predictor after removing the linear effects of all the other predictors. On the x-axis you see the residuals of that predictor after regressing it on the remaining predictors, and on the y-axis you see the residuals of the response after regressing it on those same remaining predictors. If the points line up, that predictor contributes linearly to the model. The fitted line in the plot has a slope equal to the predictor’s coefficient in the full model, and its tightness reflects the partial correlation given the other variables.
Indeed, keeping the spread of the values on the y-axis, we clearly see that the line-fits for concentration and mixing show predictive capacity, while the lines for temperature and pH just cut horizontally through the noise. As suggested by the Pareto plot, time is really borderline. From the interactions group, temperature:time shows clear separation between the levels, confirming that this unexpected interaction is real.
Assessing the Model Fit
Before we make any conclusions, let’s make sure that our chosen model is appropriate for our experiment.
When assessing a model, the first place we start looking is the residuals. Let’s look at their distribution.
fitted = model.fittedvalues
resid = model.resid
fig, ax = plt.subplots()
ax.scatter(fitted, resid, alpha=0.8)
ax.axhline(0, color='k', linewidth=1)
ax.set_xlabel('Fitted values')
ax.set_ylabel('Residuals')
ax.set_title('Residuals vs Fitted')
plt.show()
The residuals are randomly scattered across the range of fitted values with a roughly constant spread, indicating homoscedasticity and no obvious pattern. This suggests the model captures the signal without systematic bias across the predictor space. In short, no region appears over- or underfit, and the remaining variation looks like noise rather than missed structure.
We can assess it a little more formally with a Q-Q Plot and a Shapiro-Wilk test:
print("Shapiro–Wilk p-value:",
stats.shapiro(resid).pvalue.round(4))
qqplot(resid, line='s')
plt.title("Q–Q plot of residuals")
plt.show()
Shapiro–Wilk p-value: 0.2007
Both the Shapiro-Wilk test and the Q-Q plot confirm that the residuals are normally distributed. This, combined with the adjusted \( R^{2} = 0.991 \), tell us that the model fits our data quite well.
Refining the Model
Despite the fact that our model seems to explain the data well, it contains a lot of insignificant terms. Their presence may be diluting the significance of other factors. In fact, the borderline significance of time may improve once we prune out the insignificant terms. Let’s re-evaluate the model that only keeps the terms that surpass our un-adjusted alpha level. Before we start the re-evaluation, let’s package our model evaluation code in an object for better reusability.
class evaluate_model(object):
def __init__(self, model_equation, alpha=0.05):
self.model_equation = model_equation
self.alpha = alpha
self.model = smf.ols(model_equation, data=df).fit()
self.summary_tables = self.model.summary2().tables
self.aov = self.summary_tables[1]
self.aov = pd.concat([self.aov,
anova_lm(self.model, typ=3)],
axis=1)
self.aov['PR(>F)'] = self.aov['PR(>F)'].round(4)
self.aov['mean_sq'] = self.aov['sum_sq'] / self.aov['df']
y = df['result'].to_numpy()
sst = ((y - y.mean())**2).sum()
sse = self.aov.loc['Residual', 'sum_sq']
dfe = self.aov.loc['Residual', 'df']
mse = self.aov.loc['Residual', 'mean_sq']
ss_model = sst - sse
def report(self):
display(Markdown("### Fit Summary"))
display(self.summary_tables[0])
display(Markdown("### ANOVA Table"))
display(self.aov)
display(Markdown("### Pareto Plot"))
model_params = pd.Index(self.model.params.index)
no_constant = (~model_params.str.lower()
.isin(['intercept','const']))
compared_params = model_params[no_constant]
tvals = self.model.tvalues[compared_params]
abs_t = tvals.abs().sort_values()
df_e = int(self.model.df_resid)
t_alpha = stats.t.ppf(1 - self.alpha/2, df_e)
t_alpha_bonf = stats.t.ppf(1 - (self.alpha
/len(compared_params))/2,
df_e)
self.significant_parameters = (abs_t.loc[abs_t >= t_alpha]
.index.to_list())
fig, ax = plt.subplots(figsize=(8,6))
ax.barh(abs_t.index, abs_t.values)
ax.set_xlabel('Absolute standardized effect (|t|)')
ax.set_title('Effect (Pareto) Plot — Standardized Effects')
ax.axvline(t_alpha, linestyle='--',
color='k', linewidth=1,
label=f'α={self.alpha:.2g}')
if t_alpha_bonf is not None:
ax.axvline(t_alpha_bonf,
linestyle=':', color='k',
linewidth=1,
label=f'Bonferroni α/m={self.alpha
/len(compared_params):.2g}')
ax.legend(loc='upper right')
ax.invert_yaxis()
plt.tight_layout()
plt.show()
display(Markdown("### Partial Regression Plot"))
fig = plt.figure(figsize=(9, 18))
g = sm.graphics.plot_partregress_grid(self.model, fig=fig)
display(Markdown("### Residual Plot"))
fitted = self.model.fittedvalues
resid = self.model.resid
fig, ax = plt.subplots()
ax.scatter(fitted, resid)
ax.axhline(0, color='k', linewidth=1)
ax.set_xlabel('Fitted values')
ax.set_ylabel('Residuals')
ax.set_title('Residuals vs Fitted')
plt.show()
display(Markdown("### Residual Normality Tests"))
print("Shapiro–Wilk p-value for the Residuals:",
stats.shapiro(resid).pvalue.round(4))
qqplot(resid, line='s')
plt.title("Q–Q plot of residuals")
plt.show()
If we re-run the new object on the original (full) equation, we can extract the significant terms and build a refined equation. If we are including interaction terms, we should always keep their parent terms as well, even if they are themselves not significant.
model1 = evaluate_model(model_equation)
refined_model_equation = "result ~ "
for param in model1.significant_parameters:
refined_model_equation += f"{param} + "
refined_model_equation = (refined_model_equation[:-3]
if refined_model_equation[-3:] == " + "
else refined_model_equation)
print(refined_model_equation)
Which outputs:
result ~ time + mixing + temperature:time + concentration + temperature
Now we can run our model evaluation code on the refined equation:
model2 = evaluate_model(refined_model_equation)
model2.report()
Which outputs the following information about the refined model:
Pareto Plot
Partial Regression Plot
Residual Plot
Residual Normality Tests Shapiro–Wilk p-value for the Residuals: 0.405
We see that with fewer terms, the time parameter now exceeds our adjustded alpha level of significance. The adjusted \(R^{2} \) remains 0.991, signifying that the removed terms did not explain any of the variability in the results. Both the AIC and BIC parameters went down, AIC from 30.1 to 17.6, and BIC from 64.7 to 30.5, signifying improvement in model quality. The updated model seems to check all of the boxes, so we can end the refinement process here.
Model Interpretation
Our statistical model has confirmed that precursor concentration, mixing speed, and reaction time have significant impact on the reaction yield, while reaction temperature, and solution pH do not. What is perhaps most surprising, is that the interaction term between temperature and reaction time is significant. How can this be, and what exactly does it mean? How can temperature have virtually no effect at shorter times, but have a measurable effect at longer time? After scratching his head about these result for a bit, our Chemist realizes that one factor he did not directly control in this experiment is solvent evaporation. At low temperatures, the solvent barely evaporates either at long or short experiment durations, while at increased temperatures, duration of the experiment significantly affects the extent of solvent evaporation. This, in turn, changes another uncontrolled parameter of the reaction: the ionic strength of the solvent. With this insight, the Chemist is now able to come up with an engineering solution (re-design the reactor to prevent evaporation), as well as a new experiment design that involves the ionic strength of the solvent as one of the experimental factors.
This kind of insight would not be possible with the one-variable-at-a-time experimental approach. Being able to characterize factor interactions is one of the greatest strengths of DOE, and our Chemist sure is thankful for being able to use this technique to solve his problem.
Process Optimization
Optimizing processes with a simple linear model is usually trivial: maximize parameters with a positive slope, and minimize the ones with a negative slope. But for the sake of the examples, let’s present a simple corner search algorithm using our refined model.
def bounds_from_factors(factors):
return {k: (min(v), max(v)) for k, v in factors.items()}
def corner_search(model, factors, alpha=0.05):
bnds = bounds_from_factors(factors)
names = list(bnds.keys())
corners = list(itertools.product(*[(lo, hi)
for (lo, hi) in bnds.values()]))
Xcorner = pd.DataFrame(corners, columns=names)
sf = model.get_prediction(Xcorner).summary_frame(alpha=alpha)
i_max = int(sf['mean'].idxmax())
i_min = int(sf['mean'].idxmin())
res = {
'maximize': {
'settings': Xcorner.iloc[i_max],
'pred': float(sf.loc[i_max, 'mean']),
'mean_CI': (float(sf.loc[i_max, 'mean_ci_lower']),
float(sf.loc[i_max, 'mean_ci_upper'])),
'PI': (float(sf.loc[i_max, 'obs_ci_lower']),
float(sf.loc[i_max, 'obs_ci_upper'])),
},
'minimize': {
'settings': Xcorner.iloc[i_min],
'pred': float(sf.loc[i_min, 'mean']),
'mean_CI': (float(sf.loc[i_min, 'mean_ci_lower']),
float(sf.loc[i_min, 'mean_ci_upper'])),
'PI': (float(sf.loc[i_min, 'obs_ci_lower']),
float(sf.loc[i_min, 'obs_ci_upper'])),
}
}
return res
corner_search(model2.model, factors, alpha=0.05)
Which produces:
{'maximize': {'settings': concentration 120
temperature 5
pH 5
mixing 50
time 12
Name: 19, dtype: int64,
'pred': 10.540625000000018,
'mean_CI': (10.37785691344415, 10.703393086555886),
'PI': (9.984666143597542, 11.096583856402495)},
'minimize': {'settings': concentration 20
temperature 20
pH 5
mixing 10
time 12
Name: 9, dtype: int64,
'pred': 1.4843749999999933,
'mean_CI': (1.321606913444126, 1.6471430865558607),
'PI': (0.9284161435975177, 2.040333856402469)}}
Wrap-up
In this tutorial we demonstrated the power of a simple full factorial DOE study to interpret real interactions in practical process terms. In future notes we will extend this workflow to more economical designs, introduce screening and blocking, get familiar with more formal lack-of-fit assessments and pure-error checks, discuss factor coding and hierarchical model selection, cover multi-response optimization with desirability functions, and touch on measurement system considerations and power planning for sequential experimentation.
- Overview
- Full Factorial Tutorial