Skip to content

Lesson 2.2: Layering and Caching

Welcome to Lesson 2.2! You've already built a few images, but now it's time to understand what's happening behind the scenes. Docker images are composed of layers, and Docker uses a build cache to speed up subsequent builds. In this lesson, you'll learn how layers work, how caching can dramatically improve build times, and best practices to write efficient Dockerfiles.


Learning Objectives

TIP

By the end of this lesson, you will be able to:

  • Explain how Docker images are built as a stack of layers.
  • Describe how layer caching works during docker build.
  • Identify which instructions invalidate the cache and why.
  • Reorder Dockerfile instructions to maximize cache reuse.
  • Use --no-cache and other build options to control caching.
  • Apply best practices to create smaller, faster-building images.

1. How Layers Work

When you build an image using a Dockerfile, each instruction (like FROM, RUN, COPY) creates a new layer. Layers are stacked on top of each other, and each layer is only the changes from the previous layer.

1.1. Layered Filesystem

Docker uses a storage driver (like overlay2) to combine these layers into a single unified filesystem. When you run a container, Docker adds a thin writable container layer on top of the image layers.

INFO

Key characteristics:

  • Layers are read-only (except the container layer).
  • Each layer is identified by a unique hash (SHA256).
  • Layers are cached and reused across images if they are identical.
  • If a layer hasn't changed, Docker can reuse it from the cache, skipping the rebuild.

1.2. Viewing Layers

You can see the layers of an image with:

bash
docker history <image>

For example, docker history nginx:latest shows each layer with its creation command and size. Layers marked <missing> are intermediate layers from the build process (they don't exist as separate images but are part of the image history).

Visual: Image Layer Stack

+---------------------------+
|     Writable Layer       |  <- Container (ephemeral)
+---------------------------+
|         CMD / LABEL       |  <- Layer N
+---------------------------+
|     COPY . .              |  <- Layer 3
+---------------------------+
|   RUN pip install ...     |  <- Layer 2
+---------------------------+
|   COPY requirements.txt   |  <- Layer 1
+---------------------------+
|   WORKDIR /app            |  <- Layer 0
+---------------------------+
|   FROM python:3.11-slim  |  <- Base Image
+---------------------------+

2. Layer Caching During Build

When you run docker build, Docker executes each instruction in order. For each instruction, Docker checks if it can reuse a cached layer from a previous build.

2.1. Cache Matching Rules

Docker looks for an existing layer that matches the instruction and the build context. The matching is based on:

  • The instruction itself (e.g., RUN apt-get update).
  • The exact command string.
  • For COPY and ADD, the checksum of the files being copied.
  • Base image and previous layers.

If a match is found, Docker uses the cached layer and moves to the next instruction. If not, it executes the instruction and all subsequent instructions are executed (cache invalidated).

2.2. Cache Invalidation Triggers

WARNING

Cache is invalidated when:

  • The instruction changes (e.g., you modify a RUN command).
  • For COPY/ADD, if any file content changes (checksum differs).
  • The base image changes (e.g., you update the tag from ubuntu:22.04 to ubuntu:23.04).
  • A previous layer was rebuilt, forcing all later layers to rebuild.

2.3. Example: Cache in Action

Consider this Dockerfile:

dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "app.py"]

First build:

  • All layers are built fresh.

Second build (no changes):

  • Docker checks each instruction:
    • FROM – cached.
    • WORKDIR – cached.
    • COPY requirements.txt . – no changes → cached.
    • RUN pip install ... – cached.
    • COPY . . – no changes → cached.
  • Entire build uses cache → instant.

Third build (you modify app.py but not requirements.txt):

  • FROM, WORKDIR, COPY requirements.txt are cached.
  • RUN pip install is cached (because its input – requirements.txt – hasn't changed).
  • COPY . . sees that files (including app.py) have changed → cache invalidated, this layer rebuilds.
  • CMD is just metadata, but it's part of the image. Since the previous layer rebuilt, all subsequent layers (none here) would rebuild if there were any.

Result: Only the final COPY . . and metadata steps are rebuilt – much faster than a full rebuild.

Fourth build (you modify requirements.txt):

  • FROM, WORKDIR cached.
  • COPY requirements.txt . – file changed → cache invalidated, this layer rebuilds.
  • RUN pip install – because previous layer changed, cache is invalidated, it rebuilds.
  • COPY . . – because previous layer changed, it rebuilds (even though app.py didn't change, the cache is broken further down).

TIP

This shows the importance of ordering instructions: put things that change less often earlier in the Dockerfile.


3. Best Practices for Leveraging Cache

3.1. Order Instructions from Least to Most Frequently Changing

Typical order:

  1. Base image (FROM) – rarely changes.
  2. Metadata (LABEL, WORKDIR, ENV) – may change occasionally.
  3. Dependency definitions (COPY requirements.txt, package.json) – change moderately.
  4. Dependency installation (RUN pip install, npm install) – based on above.
  5. Source code (COPY . .) – changes most frequently.

This maximizes cache hits for expensive steps like dependency installation.

3.2. Combine RUN Commands to Reduce Layers

Each RUN creates a layer. While more layers aren't necessarily bad, combining related commands (e.g., apt-get update && apt-get install -y ...) reduces the number of layers and also prevents caching issues where one RUN might leave behind temporary files that another RUN would need. It's also good practice to clean up in the same layer.

Bad:

dockerfile
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get clean

Good:

dockerfile
RUN apt-get update && \
    apt-get install -y curl && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

3.3. Use Specific Base Image Tags

Using :latest can break caching if the base image updates. Pin to a specific version (e.g., python:3.11-slim) for reproducible builds.

3.4. Leverage Buildkit for Better Caching (Optional)

Docker Buildkit (enabled by default in recent versions) offers advanced caching features like mounting cache directories, but that's beyond this lesson.

3.5. Use --no-cache When You Need a Fresh Build

Sometimes you want to bypass the cache entirely, e.g., to force a re-download of packages or to ensure all steps run:

bash
docker build --no-cache -t myimage .

3.6. Use --cache-from for CI/CD

In CI pipelines, you can specify an external image as a cache source. This is advanced but worth knowing.


4. Inspecting and Debugging Caching

4.1. --progress=plain to See Cache Status

When building, you can see which steps are using cache by setting build output to plain:

bash
docker build --progress=plain -t myimage .

Lines with CACHED indicate cache hits.

4.2. docker history to Inspect Layers

bash
docker history myimage

Shows the size and creation time of each layer, helping you see if layers are unexpectedly large.

4.3. dive Tool for Advanced Layer Inspection

The open-source tool dive provides an interactive way to explore layers, see what each adds, and identify wasted space.

Install and run:

bash
dive myimage

Hands-On Tasks

Task 1: Observe Caching in Action

  1. Create a directory cache-demo.
  2. Create a Dockerfile:
    dockerfile
    FROM alpine:latest
    RUN echo "Step 1: Installing packages" && sleep 2
    RUN echo "Step 2: Configuring" && sleep 2
    COPY . /app
    RUN echo "Step 3: Building" && sleep 2
    CMD echo "Done"
  3. Build with docker build -t cache-demo . – note the time.
  4. Build again (no changes) – observe that all steps are cached and it's instant.
  5. Modify a file in the context (e.g., create a new file) and rebuild. Which steps are cached? Which rebuild? (The COPY layer should invalidate and everything after it rebuilds.)

Task 2: Optimize a Dockerfile

Start with a suboptimal Dockerfile:

dockerfile
FROM node:18
COPY . /app
WORKDIR /app
RUN npm install
CMD ["npm", "start"]
  1. Build it (first build).
  2. Modify a source file (e.g., index.js) and rebuild. Notice that npm install reruns even though dependencies didn't change – this is because COPY . /app copies everything, including source changes, causing cache invalidation before RUN npm install.
  3. Optimize by reordering:
    dockerfile
    FROM node:18
    WORKDIR /app
    COPY package*.json ./
    RUN npm install
    COPY . .
    CMD ["npm", "start"]
  4. Rebuild (fresh). Then modify a source file again and rebuild. Observe that npm install is now cached.

Task 3: Experiment with Cache Invalidation Triggers

  1. Create a Dockerfile that uses an environment variable:
    dockerfile
    FROM alpine
    ENV GREETING="Hello"
    RUN echo $GREETING > /message
    CMD cat /message
  2. Build and run – prints "Hello".
  3. Change ENV GREETING="Hi" and rebuild. Is the RUN layer cached? (No, because the environment changed and affects the command string.)
  4. Try changing only the value but keeping the instruction identical? (Actually, the instruction string is the same, but Docker may detect that the environment changed – it's safest to assume cache is invalidated.)

Task 4: Use .dockerignore to Improve Cache Efficiency

  1. Create a project with a large, irrelevant directory (e.g., node_modules or data).
  2. Write a Dockerfile that copies the entire context.
  3. Build, then touch a file inside the ignored directory. Without .dockerignore, the COPY layer would detect changes and invalidate. With .dockerignore, changes to ignored files do not affect the checksum of the copy. Test this:
    • Create .dockerignore with data/.
    • Build.
    • Modify a file inside data/ and rebuild – the COPY layer should be cached.
    • Modify a non-ignored file – the COPY layer rebuilds.

Task 5: Compare Layer Sizes with docker history

  1. Build the optimized and unoptimized versions of a Dockerfile (e.g., one that installs packages and copies source).
  2. Run docker history <image> on both and compare the sizes of layers. Note how combined RUN commands produce smaller total size because intermediate files are removed in the same layer.

Summary

Key Takeaways

  • Docker images consist of layers, each representing a set of changes.
  • The build cache reuses unchanged layers, speeding up subsequent builds.
  • Cache is invalidated when the instruction or its input (files, base image) changes.
  • Order instructions from least to most frequently changing to maximize cache hits.
  • Combine related commands in a single RUN to reduce layers and clean up.
  • Use .dockerignore to prevent irrelevant file changes from invalidating cache.
  • Tools like docker history and dive help analyze layer efficiency.

Check Your Understanding

  1. What is a Docker image layer?
  2. How does Docker decide whether to use a cached layer for a COPY instruction?
  3. If you change a file that is not copied into the image (i.e., it's excluded by .dockerignore), will that invalidate the cache for the COPY layer? Why or why not?
  4. Why is it beneficial to copy dependency files (requirements.txt, package.json) before copying the rest of the source code?
  5. What command can you use to see the layers of an existing image?
  6. When would you use docker build --no-cache?
Click to see answers
  1. A Docker image layer is a read-only snapshot of filesystem changes created by a single Dockerfile instruction. Layers are stacked to form the complete filesystem of an image.
  2. Docker computes a checksum of the files being copied. If the checksum matches the cached layer's checksum, the cached layer is reused.
  3. No. Since .dockerignore excludes the file from the build context, Docker never sees it, so the checksum of the COPY operation remains unchanged.
  4. Because dependency files change less frequently than source code. This way, when you only modify source files, the dependency installation layer stays cached and doesn't need to rerun.
  5. docker history <image> shows all layers with their sizes and creation commands.
  6. When you need a completely fresh build, such as forcing re-download of packages, bypassing stale cache, or ensuring all steps run in a CI/CD environment.

Additional Resources


Next Up

In the next lesson, we'll cover environment variables and build arguments, giving you more control over your images. See you there!