wp-to-jekyll
WordPress to Jekyll Migration
Step-by-step guide for converting WordPress content into a Jekyll static site. Covers extracting content from a WordPress HTML clone or XML export, transforming it into Jekyll-compatible files with proper frontmatter, setting up collections, and deploying.
When to Apply
Reference this skill when:
- Converting a WordPress site to Jekyll
- Setting up a Jekyll project from WordPress content
- Transforming WordPress HTML/XML into Jekyll markdown or HTML with frontmatter
- Cleaning up WordPress markup artifacts for static site use
- Configuring Jekyll collections to match WordPress content types
Quick Start — Starter Template
Clone the jekyllwind starter repo to get a pre-configured Jekyll + Tailwind CSS project:
git clone https://github.com/koolamusic/jekyllwind my-jekyll-site
cd my-jekyll-site
bundle install && pnpm install
bundle exec jekyll serve # Dev server at localhost:4000
This gives you a working Jekyll + Tailwind foundation with PostCSS already configured. From here, add your migrated WordPress content into the _posts/, _layouts/, and pages/ directories.
Why use the starter? Setting up Jekyll with Tailwind CSS and PostCSS from scratch requires coordinating Ruby gems, Node packages, and build config. The starter handles all of this so you can focus on migrating content.
Prerequisites
System Dependencies
| Tool | Version | Purpose |
|---|---|---|
| Ruby | 3.2+ | Jekyll runtime |
| Bundler | latest | Ruby dependency management |
| Node.js | 18+ | Tailwind CSS / asset compilation |
| pnpm | latest | Node package manager (used by jekyllwind starter) |
| Python 3 | 3.8+ | Content extraction and cleanup scripts |
| BeautifulSoup4 | latest | HTML parsing in Python scripts |
Starting From Scratch (without the starter)
If you prefer to set up manually instead of cloning jekyllwind:
# Gemfile
gem 'jekyll', '~> 4.4'
gem 'webrick' # Dev server (Ruby 3.x dropped it from stdlib)
gem 'jekyll-postcss-v2' # PostCSS/Tailwind integration (optional)
gem 'jekyll-feed' # RSS/Atom feed generation
gem 'jekyll-sitemap' # XML sitemap
gem 'jekyll-seo-tag' # Meta tags and structured data
gem 'logger' # Ruby 3.x stdlib extraction
gem 'csv' # Ruby 3.x stdlib extraction
gem 'base64' # Ruby 3.x stdlib extraction
Bootstrap
bundle install && pnpm install # Install all dependencies
bundle exec jekyll serve # Dev server at localhost:4000
bundle exec jekyll build # Production build to _site/
Phase 1 — Source Acquisition
You need WordPress content in one of two forms:
Option A: Site Mirror (HTML clone)
Mirror the live WordPress site to capture rendered HTML and media:
# HTTrack
httrack "https://your-site.com" -O ./mirror \
--mirror --robots=0 --depth=10
# wget alternative
wget --mirror --convert-links --adjust-extension \
--page-requisites --no-parent https://your-site.com
The mirror gives you the rendered output of every page, including page builder content, plugin output, and all media files.
Option B: WordPress XML Export (WXR)
Export from WP Admin → Tools → Export. This gives you structured content with metadata but no media files and no rendered page builder output.
Recommended: Use Both
The XML export provides metadata (dates, categories, tags). The mirror provides clean rendered HTML and media files. Cross-reference both for the best result.
Phase 2 — Content Extraction
A Python script parses each source file, classifies it by URL pattern, extracts frontmatter, pulls body content, and writes Jekyll-compatible files.
URL-to-Collection Mapping
| WordPress URL Pattern | Jekyll Output | Collection |
|---|---|---|
/YYYY/MM/DD/slug/ |
_posts/YYYY-MM-DD-slug.html |
Blog posts |
/category/slug/ |
Skip (Jekyll generates these) | — |
/tag/slug/ |
Skip (Jekyll generates these) | — |
/author/slug/ |
Skip or pages/ |
— |
/page-slug/ |
pages/slug.html |
Standalone pages |
| Custom post types | _collection-name/slug.html |
Custom collections |
Frontmatter Extraction
Derive YAML frontmatter from WordPress HTML meta tags or WXR XML:
From HTML mirror (OpenGraph tags):
# Meta tags → frontmatter fields
'og:title' → title
'og:description' → description
'og:image' → image
'og:url' → permalink
'article:published_time' → date
'article:modified_time' → last_modified_at
'article:tag' → tags (multiple)
'article:section' → categories
From WXR XML:
# XML elements → frontmatter fields
'<title>' → title
'<wp:post_date>' → date
'<category domain="category">' → categories
'<category domain="post_tag">' → tags
'<wp:status>' → published (true/false)
'<content:encoded>' → body content
Content Body Extraction (from HTML mirror)
Isolate the article body from WordPress page chrome:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# Target the main content div — class varies by theme
content = (
soup.find("div", class_="post-content") or
soup.find("div", class_="entry-content") or
soup.find("article", class_="post") or
soup.find("main")
)
This strips navigation, sidebars, related posts, comments, and footer.
Image Handling
- Prefer
data-srcoversrc(WordPress lazy-loading stores real URL indata-src) - Strip query params:
?resize=800,600&ssl=1→ clean URL - Rewrite paths:
wp-content/uploads/2024/01/photo.jpg→/assets/images/uploads/2024/01/photo.jpg - Download all images locally to
assets/images/uploads/YYYY/MM/
Example Output
_posts/2024-01-15-my-blog-post.html
---
layout: post
title: "My Blog Post Title"
date: 2024-01-15
description: "A brief description from the meta tag"
image: /assets/images/uploads/2024/01/featured.jpg
categories: [technology, web-development]
tags: [jekyll, wordpress, migration]
---
<p>The extracted and cleaned article content goes here...</p>
Phase 3 — Content Cleanup
WordPress themes inject deeply nested wrapper divs, custom classes, and inline styles. A cleanup script handles these systematically.
Cleanup Pipeline
Run a Python script with BeautifulSoup to clean WordPress artifacts:
-
Strip Gutenberg comments — Remove
<!-- wp:paragraph -->,<!-- /wp:image -->, etc.import re content = re.sub(r'<!--\s*/?wp:.*?-->', '', content) -
Unwrap page builder containers — Peel nested wrappers from Visual Composer, Elementor, Divi:
# Classes to unwrap (element replaced by its children) unwrap_classes = [ 'wpb_row', 'row-fluid', 'vc_inner', 'vc_column_container', 'wp-block-image', 'wp-block-embed', 'wp-block-gallery', 'elementor-widget-container', 'et_pb_module_inner' ] for cls in unwrap_classes: for el in soup.find_all(class_=cls): el.unwrap() -
Clean images — Strip WordPress-specific attributes, add lazy loading:
for img in soup.find_all('img'): # Use data-src if available (lazy loading plugins) if img.get('data-src'): img['src'] = img['data-src'] # Strip WP attributes for attr in ['data-src', 'srcset', 'sizes', 'width', 'height', 'decoding', 'fetchpriority', 'title']: img.attrs.pop(attr, None) # Strip WP classes if img.get('class'): img['class'] = [c for c in img['class'] if not c.startswith(('wp-image-', 'aligncenter', 'size-'))] img['loading'] = 'lazy' -
Strip inline styles — WordPress content often has inline
styleattributes that override new theme CSS:for el in soup.find_all(style=True): del el['style'] -
Remove empty elements — Iteratively delete empty
<p>,<div>,<span>, headings:for tag_name in ['p', 'div', 'span', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']: for el in soup.find_all(tag_name): if not el.get_text(strip=True) and not el.find(['img', 'iframe', 'video']): el.decompose() -
Remove duplicate headings — If
<h1>text matches the frontmatter title, remove it (the Jekyll layout renders it) -
Embed media URLs — Convert bare YouTube/Vimeo URLs to responsive iframes:
# Convert: https://www.youtube.com/watch?v=XXXX # To: <div class="aspect-video"><iframe src="https://www.youtube.com/embed/XXXX" ...></iframe></div>
Running Cleanup
python3 scripts/clean.py # Dry-run: prints changes
python3 scripts/clean.py --apply # Apply changes in-place
python3 scripts/clean.py --backup # Backup originals first
python3 scripts/clean.py --file x.html # Process single file
Phase 4 — Jekyll Architecture
Directory Structure
If you cloned the jekyllwind starter, you already have the base structure. Extend it for your migrated content:
my-jekyll-site/ # git clone https://github.com/koolamusic/jekyllwind
├── _config.yml # Site config, collections, defaults, plugins
├── _posts/ # Blog posts (YYYY-MM-DD-slug.html or .md)
├── _drafts/ # Unpublished posts (add this)
├── _layouts/ # Page templates (from starter)
│ ├── default.html # Base layout: <html>, <head>, nav, footer
│ ├── post.html # Blog post template
│ └── page.html # Generic page template
├── _includes/ # Reusable components (add as needed)
│ ├── header.html
│ ├── footer.html
│ └── post-card.html
├── pages/ # Standalone pages (about, contact, etc.)
├── assets/
│ ├── css/main.css # Tailwind directives (from starter)
│ └── images/uploads/ # Migrated WordPress media (YYYY/MM/)
├── Gemfile # Ruby dependencies (from starter)
├── package.json # Node dependencies (from starter)
├── tailwind.config.js # Tailwind theme config (from starter)
├── postcss.config.js # PostCSS pipeline (from starter)
└── netlify.toml # Deployment config (add this)
Custom Collections
Map WordPress custom post types to Jekyll collections in _config.yml:
# Preserve WordPress URL structure
permalink: /:year/:month/:day/:title/
collections:
# Example: WordPress 'portfolio' post type → Jekyll collection
portfolio:
output: true
permalink: /portfolio/:title/
# Example: WordPress 'talks' post type
talks:
output: true
permalink: /talks/:title/
# Set default layouts per collection
defaults:
- scope: { path: "", type: "posts" }
values: { layout: "post" }
- scope: { path: "", type: "portfolio" }
values: { layout: "portfolio" }
- scope: { path: "", type: "talks" }
values: { layout: "portfolio" }
- scope: { path: "pages" }
values: { layout: "page" }
Key Config Settings
# _config.yml
title: "Your Site Name"
url: "https://your-site.com"
description: "Site description for SEO"
# Plugins
plugins:
- jekyll-feed
- jekyll-sitemap
- jekyll-seo-tag
- jekyll-postcss-v2 # Only if using Tailwind
# Feed configuration (generates RSS)
feed:
collections:
- posts
- portfolio
Phase 5 — Content Format Conversion
HTML to Markdown (Optional)
You can optionally convert extracted HTML content to Markdown for easier editing. Not all content converts cleanly — complex layouts, tables, and embedded media may be better left as HTML.
Good candidates for Markdown conversion:
- Text-heavy blog posts
- Simple pages with headings, paragraphs, lists, links, images
Keep as HTML:
- Posts with complex layouts or embedded widgets
- Content with custom CSS classes you want to preserve
- Pages with embedded iframes, forms, or interactive elements
Conversion approach:
import markdownify
# Convert HTML to Markdown, preserving images and links
markdown_content = markdownify.markdownify(
html_content,
heading_style="atx", # Use # style headings
bullets="-", # Use - for unordered lists
strip=['script', 'style'] # Remove script and style tags
)
Frontmatter Field Mapping
| WordPress Field | Jekyll Frontmatter | Notes |
|---|---|---|
| Post title | title |
Wrap in quotes if contains colons |
| Published date | date |
Format: YYYY-MM-DD or YYYY-MM-DD HH:MM:SS +0000 |
| Slug | permalink |
Only if overriding the default pattern |
| Categories | categories |
Array: [cat1, cat2] |
| Tags | tags |
Array: [tag1, tag2] |
| Featured image | image |
Path to local file in assets/images/ |
| Meta description | description |
From Yoast/RankMath or og:description |
| Author | author |
String or reference to _data/authors.yml |
| Post status: draft | Move to _drafts/ |
Drafts don't need a date prefix in filename |
| Custom fields | Custom frontmatter keys | Map ACF fields to meaningful frontmatter names |
| Password protected | protected: true |
Implement client-side gating in layout |
Phase 6 — Build & Deployment
Local Development
bundle exec jekyll serve --livereload # Dev server with auto-reload
bundle exec jekyll serve --drafts # Include drafts
bundle exec jekyll build # Production build to _site/
Netlify Deployment
# netlify.toml
[build]
command = "bundle exec jekyll build"
publish = "_site"
[build.environment]
JEKYLL_ENV = "production"
RUBY_VERSION = "3.2.0"
NODE_VERSION = "18"
GitHub Pages Deployment
# .github/workflows/jekyll.yml
name: Deploy Jekyll
on:
push:
branches: [main]
jobs:
build-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ruby/setup-ruby@v1
with:
ruby-version: '3.2'
bundler-cache: true
- run: bundle exec jekyll build
- uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./_site
URL Redirects
Preserve WordPress URLs that changed during migration:
# netlify.toml redirects
[[redirects]]
from = "/feed/"
to = "/feed.xml"
status = 301
[[redirects]]
from = "/wp-content/uploads/*"
to = "/assets/images/uploads/:splat"
status = 301
For Jekyll-native redirects, use the jekyll-redirect-from gem:
# In a post's frontmatter
redirect_from:
- /old-url/
- /another-old-url/
Lessons Learned & Pitfalls
Content Extraction
-
WordPress lazy loading:
data-srcvssrc— Lazy-loading plugins store the real image URL indata-srcand a placeholder insrc. Always checkdata-srcfirst. -
Image query parameters must be stripped — WordPress appends
?resize=800,600&ssl=1. These break on static hosting. -
Gutenberg comments have varied syntax — Some are self-closing (
<!-- wp:jetpack/slideshow {...} /-->), some wrap content. Use regex:<!--\s*/?wp:.*?-->. -
Visual Composer nesting is extreme — A single image can be wrapped in 5+ layers of divs. Your cleanup script needs multiple unwrapping passes.
Build & Tooling
-
cssnano + csso/css-tree incompatibility — If using PostCSS, do NOT add cssnano. It pulls in csso which breaks with certain css-tree versions.
-
jekyll-postcss-v2requires empty frontmatter — Your CSS file must start with---\n---for Jekyll to process it through PostCSS. -
Tailwind arbitrary
calc()values fail with spaces —w-[calc(100%-2rem)]works;w-[calc(100% - 2rem)]does not.
Design & CSS
-
Inline
styleattributes override Tailwind dark mode — WordPress content withstyle="color: #333"overridesdark:text-white. Strip all inline styles during cleanup. -
Multiple collections can share a layout — Use
_config.ymldefaults to assign the same layout to similar collection types, avoiding duplication. -
Preserve WordPress permalink structure — Set
permalink: /:year/:month/:day/:title/to maintain existing URLs and prevent 404s from external links and search engines.