5. Automation
This tutorial was adapted from the Automation and Make lesson by The Software Carpentries and reworked to match the project structure in
gecs-make/.
Automation is what turns a one-off analysis into a repeatable workflow. In the gecs-make project, we already have a useful pipeline: take a book from books/, count the words into counts/, and turn those counts into an interactive plot in figures/.
The problem is not that the commands are hard to run. The problem is that they are still hard-coded around a single example, dracula.txt. As soon as we want to repeat the same workflow for six or seven books, the process becomes repetitive, error-prone, and annoying to maintain.
This is exactly the kind of problem make is good at solving.
The Book Project
If you do not have the example project yet, clone it from GitHub first:
git clone https://github.com/igorsdub/gecs-make
cd gecs-makeIf you already cloned it earlier, move into the project directory:
cd gecs-makeThis tutorial assumes you are running the commands below from inside that repository.
The gecs-make project has this structure:
books/ raw text files
counts/ generated word-count tables
figures/ generated HTML plots
scripts/ Python analysis scripts
pixi.toml project dependencies and tasks
The README already shows the manual workflow for one book:
python scripts/get_summary.py books/dracula.txt
python scripts/count_words.py books/dracula.txt counts/dracula.tsv
python scripts/plot_counts.py counts/dracula.tsv figures/dracula.htmlThe first command, get_summary.py, is useful for inspecting metadata, but it is not part of the main file-generation pipeline. The core workflow we want to automate is the one that produces files:
books/*.txt -> counts/*.tsv -> figures/*.html
There are also Pixi tasks for the same book:
pixi run summarize
pixi run count
pixi run plot
pixi run allThat is a nice start, but the tasks in pixi.toml are still hard coded to dracula.txt. make lets us describe the dependency structure once and then apply it to every book in the project.
Before going further, make sure make is available:
make --versionIf that command fails, install make first:
- macOS: if you use Homebrew, run
brew install make. Depending on your shell setup, you may then need to usegmakeinstead ofmake, because the GNU version from Homebrew is often installed under that name. - WSL: run
sudo apt updateand thensudo apt install make. Installingbuild-essentialis also common if you want the standard Unix build tools as a group. - Pixi Global: if you already use Pixi, you can also install GNU Make with
pixi global install makeand then use it from your shell.
Makefile
make reads instructions from a file called Makefile. Create a new file named Makefile in the root of the gecs-make repository, next to books/, counts/, figures/, and scripts/. Then add the first rule below to that file.
If you want to use a different filename, for example workflow.mk, you can tell make which file to read with the -f option:
make -f workflow.mk counts/dracula.tsvFor the rest of this lesson, we will assume the file is named Makefile. Let us start by automating one real step from the project: generating a count table for Dracula.
counts/dracula.tsv : books/dracula.txt scripts/count_words.py
python scripts/count_words.py books/dracula.txt counts/dracula.tsvThis rule has three important parts:
- Target:
counts/dracula.tsvis the file we want to create. - Dependencies:
books/dracula.txtandscripts/count_words.pyare the things needed to build the target. - Recipe: the indented command tells
makehow to build the target.
The indentation matters. Recipe lines must start with a TAB character, not spaces.
Recipe lines in a Makefile must begin with a TAB. If you indent a recipe with four spaces instead, make will usually fail with an error like:
Makefile:<line-number>: *** missing separator. Stop.
If that happens, check the start of the recipe line carefully and replace the spaces with a TAB character.
If we run:
make counts/dracula.tsvthen make checks whether counts/dracula.tsv exists and whether either dependency is newer. If the target is missing or out of date, it runs the recipe.
That gives us our first big improvement over a shell script: make knows what output file a command is supposed to create, and it can decide when the command actually needs to run.
Another way to say this is that make helps us follow the DRY principle: Donβt Repeat Yourself. Instead of copying the same filenames and command patterns in lots of places, we describe the workflow once and reuse that description.
Chaining Steps Together
The workflow in gecs-make has a second stage: turning counts into an HTML figure.
counts/dracula.tsv : books/dracula.txt scripts/count_words.py
python scripts/count_words.py books/dracula.txt counts/dracula.tsv
figures/dracula.html : counts/dracula.tsv scripts/plot_counts.py
python scripts/plot_counts.py counts/dracula.tsv figures/dracula.htmlNow the dependencies form a chain:
books/dracula.txt -> counts/dracula.tsv -> figures/dracula.html
If we ask for the final output,
make figures/dracula.htmlmake will build counts/dracula.tsv first if needed, then build the figure.
This is the core idea behind make: we describe how files depend on each other, and make figures out the correct order.
This dependency structure is often described as a Directed Acyclic Graph, or DAG:
- Directed means the dependencies point in one direction, from inputs to outputs.
- Acyclic means the graph cannot contain loops such as βA depends on B and B depends on Aβ.
- Graph means we are really dealing with a network of connected files and build steps, not just one long shell script.
For this project, the DAG is simple:
books/dracula.txt -> counts/dracula.tsv -> figures/dracula.html
As the project grows to many books, the graph gets wider, but the idea stays the same. make works well because it can walk that DAG and rebuild only the parts that are affected by a change.
Phony Targets
Sometimes we want a target that stands for an action rather than a real file. For example:
.PHONY : all
all : figures/dracula.htmlNow make all is a convenient alias for building the Dracula plot.
Cleanup is another common phony target:
.PHONY : clean
clean :
rm -f counts/*.tsv figures/*.htmlThen we can run:
make clean.PHONY?
If a file named clean or all ever appears in the project, make might think the target is already up to date and skip the recipe. Marking those names as .PHONY tells make they are commands, not output files.
Automatic Variables
The two rules above repeat the same filenames in the target line and the recipe. That is easy to write for one example, but it becomes tedious quickly.
make provides automatic variables to reduce duplication:
$@means the current target$^means all dependencies for the current rule$<means the first dependency
We can rewrite the rules like this:
counts/dracula.tsv : books/dracula.txt scripts/count_words.py
python scripts/count_words.py $< $@
figures/dracula.html : counts/dracula.tsv scripts/plot_counts.py
python scripts/plot_counts.py $< $@In both cases, the scripts take the first dependency as input and write to the target path. This makes the recipes shorter and keeps the filenames consistent with the rule declaration.
That is another DRY win: the target and dependency names already appear in the rule header, so automatic variables let us reuse them instead of typing them again in the recipe.
Code is a Dependency
Notice that the rules depend not only on data files, but also on scripts.
counts/dracula.tsv : books/dracula.txt scripts/count_words.pyThat means if scripts/count_words.py changes, make knows it should regenerate counts/dracula.tsv. This is a very important idea in computational work: code changes can invalidate outputs just as much as data changes do.
Variables in Makefiles
As the workflow grows, we start repeating commands and script paths. Variables help make the Makefile easier to read and easier to update.
For example:
PYTHON=python
COUNT_SCRIPT=scripts/count_words.py
PLOT_SCRIPT=scripts/plot_counts.pyWe can then use those variables in our rules:
PYTHON=python
COUNT_SCRIPT=scripts/count_words.py
PLOT_SCRIPT=scripts/plot_counts.py
counts/dracula.tsv : books/dracula.txt $(COUNT_SCRIPT)
$(PYTHON) $(COUNT_SCRIPT) $< $@
figures/dracula.html : counts/dracula.tsv $(PLOT_SCRIPT)
$(PYTHON) $(PLOT_SCRIPT) $< $@This is especially helpful when several rules use the same script or command. If something changes later, we update the variable once instead of hunting through the whole file.
Variables are one of the simplest ways to make a Makefile more DRY. They let us keep shared information in one place instead of repeating it across many rules.
Pattern Rules
At this point, the workflow still only handles Dracula. But the project contains many books:
books/dracula.txt
books/frankenstein.txt
books/jane_eyre.txt
books/moby_dick.txt
books/sense_and_sensibility.txt
books/sherlock_holmes.txt
books/time_machine.txt
The filenames change, but the workflow shape stays the same. That is exactly what pattern rules are for.
PYTHON=python
COUNT_SCRIPT=scripts/count_words.py
PLOT_SCRIPT=scripts/plot_counts.py
counts/%.tsv : books/%.txt $(COUNT_SCRIPT)
$(PYTHON) $(COUNT_SCRIPT) $< $@
figures/%.html : counts/%.tsv $(PLOT_SCRIPT)
$(PYTHON) $(PLOT_SCRIPT) $< $@The % symbol is a wildcard stem. With these two rules, make now knows how to build:
counts/dracula.tsvfrombooks/dracula.txtcounts/frankenstein.tsvfrombooks/frankenstein.txtfigures/moby_dick.htmlfromcounts/moby_dick.tsv- and so on
That means we no longer need one handwritten rule per book.
Pattern rules are one of the biggest DRY improvements in make. We replace many nearly identical rules with one general rule that captures the shared structure of the workflow.
We can still ask for a specific target:
make figures/jane_eyre.htmland make will fill in the pattern automatically.
Functions
Pattern rules tell make how to build matching files. Functions help us build lists of files.
The most useful starting point here is wildcard, which finds filenames that match a pattern:
BOOK_FILES=$(wildcard books/*.txt)In gecs-make, that expands to all the .txt files in books/.
Next, we can use patsubst to transform those input filenames into the output files we want:
BOOK_FILES=$(wildcard books/*.txt)
COUNT_FILES=$(patsubst books/%.txt,counts/%.tsv,$(BOOK_FILES))
FIGURE_FILES=$(patsubst books/%.txt,figures/%.html,$(BOOK_FILES))Now we have three connected lists:
BOOK_FILESfor raw textsCOUNT_FILESfor generated count tablesFIGURE_FILESfor generated plots
This is what allows make to scale from one hard-coded book to the entire project.
Building Everything
Once we have those lists, we can define a proper all target:
PYTHON=python
COUNT_SCRIPT=scripts/count_words.py
PLOT_SCRIPT=scripts/plot_counts.py
BOOK_FILES=$(wildcard books/*.txt)
COUNT_FILES=$(patsubst books/%.txt,counts/%.tsv,$(BOOK_FILES))
FIGURE_FILES=$(patsubst books/%.txt,figures/%.html,$(BOOK_FILES))
.PHONY : all
all : $(FIGURE_FILES)
counts/%.tsv : books/%.txt $(COUNT_SCRIPT)
$(PYTHON) $(COUNT_SCRIPT) $< $@
figures/%.html : counts/%.tsv $(PLOT_SCRIPT)
$(PYTHON) $(PLOT_SCRIPT) $< $@
.PHONY : clean
clean :
rm -f counts/*.tsv figures/*.htmlRunning
make allwill now build the complete set of figures for all books in the repository.
make targets
Pixi tasks are still useful for environment management and for a few convenient commands. The advantage of make here is that it expresses file dependencies directly and scales naturally from one example book to a whole directory of inputs.
Self-documenting Makefile
As Makefiles grow, it becomes useful to advertise the main entry points. A simple way to do that is to add a help target.
.PHONY : help
help :
@echo "Available targets:"
@echo " make all Build all figures"
@echo " make clean Remove generated counts and figures"
@echo " make figures/dracula.html Build one specific figure"For a slightly more polished version, we can annotate targets with comments and extract them automatically:
.DEFAULT_GOAL := help
## Show available targets
.PHONY : help
help:
@echo "Available rules:"
@grep -E '^## |^[a-zA-Z0-9_.\/%\-]+ *:' $(MAKEFILE_LIST) | \
awk 'BEGIN {FS = " *:.*"} \
/^## / {desc = substr($$0, 4); next} \
/^[a-zA-Z0-9_.\/%\-]+ *:/ {printf "%-22s %s\n", $$1, desc}'
## Build figures for all books
.PHONY : all
all : $(FIGURE_FILES)
## Remove generated counts and figures
.PHONY : clean
clean :
rm -f counts/*.tsv figures/*.htmlHere $(MAKEFILE_LIST) is a built-in make variable. It expands to the makefiles that were read during the current run, so we do not have to define it ourselves. In this version, grep and awk read the Makefile and pair comments such as ## Build figures for all books with the target that appears immediately below them.
Then:
make helpprints a short summary of the workflow.
Complete Example
Putting everything together, a Makefile for gecs-make could look like this:
PYTHON=python
COUNT_SCRIPT=scripts/count_words.py
PLOT_SCRIPT=scripts/plot_counts.py
BOOK_FILES=$(wildcard books/*.txt)
COUNT_FILES=$(patsubst books/%.txt,counts/%.tsv,$(BOOK_FILES))
FIGURE_FILES=$(patsubst books/%.txt,figures/%.html,$(BOOK_FILES))
.DEFAULT_GOAL := help
## Show available targets
.PHONY : help
help:
@echo "Available rules:"
@grep -E '^## |^[a-zA-Z0-9_.\/%\-]+ *:' $(MAKEFILE_LIST) | \
awk 'BEGIN {FS = " *:.*"} \
/^## / {desc = substr($$0, 4); next} \
/^[a-zA-Z0-9_.\/%\-]+ *:/ {printf "%-22s %s\n", $$1, desc}'
## Build figures for all books
.PHONY : all
all : $(FIGURE_FILES)
## Build count tables for all books
counts : $(COUNT_FILES)
counts/%.tsv : books/%.txt $(COUNT_SCRIPT)
$(PYTHON) $(COUNT_SCRIPT) $< $@
figures/%.html : counts/%.tsv $(PLOT_SCRIPT)
$(PYTHON) $(PLOT_SCRIPT) $< $@
## Remove generated counts and figures
.PHONY : clean
clean :
rm -f counts/*.tsv figures/*.htmlThis Makefile captures the real project workflow:
- text files in
books/are inputs - count tables in
counts/are intermediate outputs - plots in
figures/are final outputs - the Python scripts in
scripts/are part of the dependency graph
Most importantly, the Makefile is no longer hard coded to Dracula. Add another .txt file to books/, and the workflow can pick it up automatically.
Final Words
make is useful when your work produces files from other files. Instead of rerunning commands by hand, we describe the dependency structure once and let make handle the ordering and incremental rebuilds.
In this project, the key ideas are:
- A rule has a target, dependencies, and a recipe.
makemodels the workflow as a DAG, a dependency graph with no cycles..PHONYis for targets likeall,clean, andhelpthat are not real files.- Automatic variables such as
$@and$<reduce duplication. - Variables, automatic variables, and pattern rules help us keep the Makefile DRY.
- Pattern rules such as
counts/%.tsv : books/%.txtgeneralize the workflow across many books. - Functions such as
wildcardandpatsubstlet us build dynamic file lists from the real project structure.
Pixi remains useful for managing the environment. make adds the missing piece: a reusable, dependency-aware way to automate the actual analysis pipeline.