WordPress to Jekyll Migration

Step-by-step guide for converting WordPress content into a Jekyll static site. Covers extracting content from a WordPress HTML clone or XML export, transforming it into Jekyll-compatible files with proper frontmatter, setting up collections, and deploying.

When to Apply

Reference this skill when:

Converting a WordPress site to Jekyll
Setting up a Jekyll project from WordPress content
Transforming WordPress HTML/XML into Jekyll markdown or HTML with frontmatter
Cleaning up WordPress markup artifacts for static site use
Configuring Jekyll collections to match WordPress content types

Quick Start — Starter Template

Clone the jekyllwind starter repo to get a pre-configured Jekyll + Tailwind CSS project:

git clone https://github.com/koolamusic/jekyllwind my-jekyll-site
cd my-jekyll-site
bundle install && pnpm install
bundle exec jekyll serve   # Dev server at localhost:4000

This gives you a working Jekyll + Tailwind foundation with PostCSS already configured. From here, add your migrated WordPress content into the _posts/, _layouts/, and pages/ directories.

Why use the starter? Setting up Jekyll with Tailwind CSS and PostCSS from scratch requires coordinating Ruby gems, Node packages, and build config. The starter handles all of this so you can focus on migrating content.

Prerequisites

System Dependencies

Tool	Version	Purpose
Ruby	3.2+	Jekyll runtime
Bundler	latest	Ruby dependency management
Node.js	18+	Tailwind CSS / asset compilation
pnpm	latest	Node package manager (used by jekyllwind starter)
Python 3	3.8+	Content extraction and cleanup scripts
BeautifulSoup4	latest	HTML parsing in Python scripts

Starting From Scratch (without the starter)

If you prefer to set up manually instead of cloning jekyllwind:

# Gemfile
gem 'jekyll', '~> 4.4'
gem 'webrick'               # Dev server (Ruby 3.x dropped it from stdlib)
gem 'jekyll-postcss-v2'     # PostCSS/Tailwind integration (optional)
gem 'jekyll-feed'           # RSS/Atom feed generation
gem 'jekyll-sitemap'        # XML sitemap
gem 'jekyll-seo-tag'        # Meta tags and structured data
gem 'logger'                # Ruby 3.x stdlib extraction
gem 'csv'                   # Ruby 3.x stdlib extraction
gem 'base64'                # Ruby 3.x stdlib extraction

Bootstrap

bundle install && pnpm install  # Install all dependencies
bundle exec jekyll serve        # Dev server at localhost:4000
bundle exec jekyll build        # Production build to _site/

Phase 1 — Source Acquisition

You need WordPress content in one of two forms:

Option A: Site Mirror (HTML clone)

Mirror the live WordPress site to capture rendered HTML and media:

# HTTrack
httrack "https://your-site.com" -O ./mirror \
  --mirror --robots=0 --depth=10

# wget alternative
wget --mirror --convert-links --adjust-extension \
  --page-requisites --no-parent https://your-site.com

The mirror gives you the rendered output of every page, including page builder content, plugin output, and all media files.

Option B: WordPress XML Export (WXR)

Export from WP Admin → Tools → Export. This gives you structured content with metadata but no media files and no rendered page builder output.

Recommended: Use Both

The XML export provides metadata (dates, categories, tags). The mirror provides clean rendered HTML and media files. Cross-reference both for the best result.

Phase 2 — Content Extraction

A Python script parses each source file, classifies it by URL pattern, extracts frontmatter, pulls body content, and writes Jekyll-compatible files.

URL-to-Collection Mapping

WordPress URL Pattern	Jekyll Output	Collection
`/YYYY/MM/DD/slug/`	`_posts/YYYY-MM-DD-slug.html`	Blog posts
`/category/slug/`	Skip (Jekyll generates these)	—
`/tag/slug/`	Skip (Jekyll generates these)	—
`/author/slug/`	Skip or `pages/`	—
`/page-slug/`	`pages/slug.html`	Standalone pages
Custom post types	`_collection-name/slug.html`	Custom collections

Frontmatter Extraction

Derive YAML frontmatter from WordPress HTML meta tags or WXR XML:

From HTML mirror (OpenGraph tags):

# Meta tags → frontmatter fields
'og:title'               → title
'og:description'         → description
'og:image'               → image
'og:url'                 → permalink
'article:published_time' → date
'article:modified_time'  → last_modified_at
'article:tag'            → tags (multiple)
'article:section'        → categories

From WXR XML:

# XML elements → frontmatter fields
'<title>'                → title
'<wp:post_date>'         → date
'<category domain="category">' → categories
'<category domain="post_tag">' → tags
'<wp:status>'            → published (true/false)
'<content:encoded>'      → body content

Content Body Extraction (from HTML mirror)

Isolate the article body from WordPress page chrome:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
# Target the main content div — class varies by theme
content = (
    soup.find("div", class_="post-content") or
    soup.find("div", class_="entry-content") or
    soup.find("article", class_="post") or
    soup.find("main")
)

This strips navigation, sidebars, related posts, comments, and footer.

Image Handling

Prefer data-src over src (WordPress lazy-loading stores real URL in data-src)
Strip query params: ?resize=800,600&ssl=1 → clean URL
Rewrite paths: wp-content/uploads/2024/01/photo.jpg → /assets/images/uploads/2024/01/photo.jpg
Download all images locally to assets/images/uploads/YYYY/MM/

Example Output

_posts/2024-01-15-my-blog-post.html

---
layout: post
title: "My Blog Post Title"
date: 2024-01-15
description: "A brief description from the meta tag"
image: /assets/images/uploads/2024/01/featured.jpg
categories: [technology, web-development]
tags: [jekyll, wordpress, migration]
---
<p>The extracted and cleaned article content goes here...</p>

Phase 3 — Content Cleanup

WordPress themes inject deeply nested wrapper divs, custom classes, and inline styles. A cleanup script handles these systematically.

Cleanup Pipeline

Run a Python script with BeautifulSoup to clean WordPress artifacts:

Strip Gutenberg comments — Remove , , etc.
```
import re
content = re.sub(r'', '', content)
```

Unwrap page builder containers — Peel nested wrappers from Visual Composer, Elementor, Divi:

# Classes to unwrap (element replaced by its children)
unwrap_classes = [
    'wpb_row', 'row-fluid', 'vc_inner', 'vc_column_container',
    'wp-block-image', 'wp-block-embed', 'wp-block-gallery',
    'elementor-widget-container', 'et_pb_module_inner'
]
for cls in unwrap_classes:
    for el in soup.find_all(class_=cls):
        el.unwrap()

Clean images — Strip WordPress-specific attributes, add lazy loading:

for img in soup.find_all('img'):
    # Use data-src if available (lazy loading plugins)
    if img.get('data-src'):
        img['src'] = img['data-src']
    # Strip WP attributes
    for attr in ['data-src', 'srcset', 'sizes', 'width', 'height',
                  'decoding', 'fetchpriority', 'title']:
        img.attrs.pop(attr, None)
    # Strip WP classes
    if img.get('class'):
        img['class'] = [c for c in img['class']
                       if not c.startswith(('wp-image-', 'aligncenter', 'size-'))]
    img['loading'] = 'lazy'

Strip inline styles — WordPress content often has inline style attributes that override new theme CSS:
```
for el in soup.find_all(style=True):
    del el['style']
```

Remove empty elements — Iteratively delete empty <p>, <div>, <span>, headings:

for tag_name in ['p', 'div', 'span', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
    for el in soup.find_all(tag_name):
        if not el.get_text(strip=True) and not el.find(['img', 'iframe', 'video']):
            el.decompose()

Remove duplicate headings — If <h1> text matches the frontmatter title, remove it (the Jekyll layout renders it)

Embed media URLs — Convert bare YouTube/Vimeo URLs to responsive iframes:

# Convert: https://www.youtube.com/watch?v=XXXX
# To: <div class="aspect-video"><iframe src="https://www.youtube.com/embed/XXXX" ...></iframe></div>

Running Cleanup

python3 scripts/clean.py              # Dry-run: prints changes
python3 scripts/clean.py --apply      # Apply changes in-place
python3 scripts/clean.py --backup     # Backup originals first
python3 scripts/clean.py --file x.html # Process single file

Phase 4 — Jekyll Architecture

Directory Structure

If you cloned the jekyllwind starter, you already have the base structure. Extend it for your migrated content:

my-jekyll-site/                   # git clone https://github.com/koolamusic/jekyllwind
├── _config.yml                   # Site config, collections, defaults, plugins
├── _posts/                       # Blog posts (YYYY-MM-DD-slug.html or .md)
├── _drafts/                      # Unpublished posts (add this)
├── _layouts/                     # Page templates (from starter)
│   ├── default.html              # Base layout: <html>, <head>, nav, footer
│   ├── post.html                 # Blog post template
│   └── page.html                 # Generic page template
├── _includes/                    # Reusable components (add as needed)
│   ├── header.html
│   ├── footer.html
│   └── post-card.html
├── pages/                        # Standalone pages (about, contact, etc.)
├── assets/
│   ├── css/main.css              # Tailwind directives (from starter)
│   └── images/uploads/           # Migrated WordPress media (YYYY/MM/)
├── Gemfile                       # Ruby dependencies (from starter)
├── package.json                  # Node dependencies (from starter)
├── tailwind.config.js            # Tailwind theme config (from starter)
├── postcss.config.js             # PostCSS pipeline (from starter)
└── netlify.toml                  # Deployment config (add this)

Custom Collections

Map WordPress custom post types to Jekyll collections in _config.yml:

# Preserve WordPress URL structure
permalink: /:year/:month/:day/:title/

collections:
  # Example: WordPress 'portfolio' post type → Jekyll collection
  portfolio:
    output: true
    permalink: /portfolio/:title/
  # Example: WordPress 'talks' post type
  talks:
    output: true
    permalink: /talks/:title/

# Set default layouts per collection
defaults:
  - scope: { path: "", type: "posts" }
    values: { layout: "post" }
  - scope: { path: "", type: "portfolio" }
    values: { layout: "portfolio" }
  - scope: { path: "", type: "talks" }
    values: { layout: "portfolio" }
  - scope: { path: "pages" }
    values: { layout: "page" }

Key Config Settings

# _config.yml
title: "Your Site Name"
url: "https://your-site.com"
description: "Site description for SEO"

# Plugins
plugins:
  - jekyll-feed
  - jekyll-sitemap
  - jekyll-seo-tag
  - jekyll-postcss-v2    # Only if using Tailwind

# Feed configuration (generates RSS)
feed:
  collections:
    - posts
    - portfolio

Phase 5 — Content Format Conversion

HTML to Markdown (Optional)

You can optionally convert extracted HTML content to Markdown for easier editing. Not all content converts cleanly — complex layouts, tables, and embedded media may be better left as HTML.

Good candidates for Markdown conversion:

Text-heavy blog posts
Simple pages with headings, paragraphs, lists, links, images

Keep as HTML:

Posts with complex layouts or embedded widgets
Content with custom CSS classes you want to preserve
Pages with embedded iframes, forms, or interactive elements

Conversion approach:

import markdownify

# Convert HTML to Markdown, preserving images and links
markdown_content = markdownify.markdownify(
    html_content,
    heading_style="atx",        # Use # style headings
    bullets="-",                 # Use - for unordered lists
    strip=['script', 'style']   # Remove script and style tags
)

Frontmatter Field Mapping

WordPress Field	Jekyll Frontmatter	Notes
Post title	`title`	Wrap in quotes if contains colons
Published date	`date`	Format: `YYYY-MM-DD` or `YYYY-MM-DD HH:MM:SS +0000`
Slug	`permalink`	Only if overriding the default pattern
Categories	`categories`	Array: `[cat1, cat2]`
Tags	`tags`	Array: `[tag1, tag2]`
Featured image	`image`	Path to local file in `assets/images/`
Meta description	`description`	From Yoast/RankMath or og:description
Author	`author`	String or reference to `_data/authors.yml`
Post status: draft	Move to `_drafts/`	Drafts don't need a date prefix in filename
Custom fields	Custom frontmatter keys	Map ACF fields to meaningful frontmatter names
Password protected	`protected: true`	Implement client-side gating in layout

Phase 6 — Build & Deployment

Local Development

bundle exec jekyll serve --livereload    # Dev server with auto-reload
bundle exec jekyll serve --drafts        # Include drafts
bundle exec jekyll build                 # Production build to _site/

Netlify Deployment

# netlify.toml
[build]
  command = "bundle exec jekyll build"
  publish = "_site"

[build.environment]
  JEKYLL_ENV = "production"
  RUBY_VERSION = "3.2.0"
  NODE_VERSION = "18"

GitHub Pages Deployment

# .github/workflows/jekyll.yml
name: Deploy Jekyll
on:
  push:
    branches: [main]
jobs:
  build-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ruby/setup-ruby@v1
        with:
          ruby-version: '3.2'
          bundler-cache: true
      - run: bundle exec jekyll build
      - uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./_site

URL Redirects

Preserve WordPress URLs that changed during migration:

# netlify.toml redirects
[[redirects]]
  from = "/feed/"
  to = "/feed.xml"
  status = 301

[[redirects]]
  from = "/wp-content/uploads/*"
  to = "/assets/images/uploads/:splat"
  status = 301

For Jekyll-native redirects, use the jekyll-redirect-from gem:

# In a post's frontmatter
redirect_from:
  - /old-url/
  - /another-old-url/

Lessons Learned & Pitfalls

Content Extraction

WordPress lazy loading: data-src vs src — Lazy-loading plugins store the real image URL in data-src and a placeholder in src. Always check data-src first.
Image query parameters must be stripped — WordPress appends ?resize=800,600&ssl=1. These break on static hosting.
Gutenberg comments have varied syntax — Some are self-closing (), some wrap content. Use regex: .
Visual Composer nesting is extreme — A single image can be wrapped in 5+ layers of divs. Your cleanup script needs multiple unwrapping passes.

Build & Tooling

cssnano + csso/css-tree incompatibility — If using PostCSS, do NOT add cssnano. It pulls in csso which breaks with certain css-tree versions.
jekyll-postcss-v2 requires empty frontmatter — Your CSS file must start with ---\n--- for Jekyll to process it through PostCSS.
Tailwind arbitrary calc() values fail with spaces — w-[calc(100%-2rem)] works; w-[calc(100% - 2rem)] does not.

Design & CSS

Inline style attributes override Tailwind dark mode — WordPress content with style="color: #333" overrides dark:text-white. Strip all inline styles during cleanup.
Multiple collections can share a layout — Use _config.yml defaults to assign the same layout to similar collection types, avoiding duplication.
Preserve WordPress permalink structure — Set permalink: /:year/:month/:day/:title/ to maintain existing URLs and prevent 404s from external links and search engines.

wp-to-jekyll