xlsx-processing
XLSX Processing
Overview
Manipulate Excel files programmatically using openpyxl for rich formatting and pandas for data analysis. This skill covers reading/writing spreadsheets, formulas, charts, conditional formatting, data validation, pivot table generation, CSV import/export, and strategies for handling large files.
Apply this skill whenever Excel files need to be created, read, transformed, or enriched through code rather than manual editing.
Multi-Phase Process
Phase 1: Requirements
- Determine operation (read, write, transform, report)
- Identify data sources and volume
- Define formatting and formula requirements
- Plan sheet structure and naming
- Assess performance needs (row count, file size)
STOP — Do NOT begin implementation until you know the row count and whether formatting is needed (this determines library choice).
Phase 2: Implementation
- Select library (see decision table)
- Implement data loading and transformation
- Apply formatting, formulas, and validation
- Add charts and conditional formatting
- Optimize for file size and memory
STOP — Do NOT skip memory optimization for files exceeding 10,000 rows.
Phase 3: Validation
- Open in Excel, LibreOffice, and Google Sheets
- Verify formulas calculate correctly
- Check formatting renders consistently
- Test with edge cases (empty data, max rows)
- Validate data accuracy
Library Decision Table
| Scenario | Library | Why |
|---|---|---|
| Rich formatting (colors, borders, fonts) | openpyxl | Full formatting API |
| Data analysis, aggregation, pivots | pandas | DataFrame operations |
| Formatted report from data analysis | pandas + openpyxl | Combine strengths |
| Reading data only, no formatting needed | pandas | Simplest API |
| Large file (> 10K rows), write-heavy | openpyxl write_only | Streaming writes, low memory |
| Large file (> 10K rows), read-heavy | openpyxl read_only | Streaming reads, low memory |
| CSV to/from Excel conversion | pandas | One-liner operations |
| Charts in spreadsheet | openpyxl | Chart API with full control |
openpyxl Patterns
Creating a Workbook
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
wb = Workbook()
ws = wb.active
ws.title = "Report"
# Header row
headers = ['Name', 'Department', 'Revenue', 'Target', 'Achievement']
header_font = Font(name='Calibri', size=11, bold=True, color='FFFFFF')
header_fill = PatternFill(start_color='2F5496', end_color='2F5496', fill_type='solid')
header_alignment = Alignment(horizontal='center', vertical='center')
thin_border = Border(
left=Side(style='thin'),
right=Side(style='thin'),
top=Side(style='thin'),
bottom=Side(style='thin'),
)
for col, header in enumerate(headers, 1):
cell = ws.cell(row=1, column=col, value=header)
cell.font = header_font
cell.fill = header_fill
cell.alignment = header_alignment
cell.border = thin_border
# Data rows
for row_idx, row_data in enumerate(data, 2):
for col_idx, value in enumerate(row_data, 1):
cell = ws.cell(row=row_idx, column=col_idx, value=value)
cell.border = thin_border
# Auto-fit column widths
for col in range(1, len(headers) + 1):
max_length = max(
len(str(ws.cell(row=row, column=col).value or ''))
for row in range(1, ws.max_row + 1)
)
ws.column_dimensions[get_column_letter(col)].width = min(max_length + 2, 50)
# Freeze header row
ws.freeze_panes = 'A2'
wb.save('report.xlsx')
Formulas
# Basic formulas
ws['E2'] = '=C2/D2' # Division
ws['F2'] = '=SUM(C2:C100)' # Sum
ws['G2'] = '=AVERAGE(C2:C100)' # Average
ws['H2'] = '=COUNTIF(E2:E100,">1")' # Count if
ws['I2'] = '=IF(E2>=1,"Met","Below")' # Conditional
ws['J2'] = '=VLOOKUP(A2,Sheet2!A:B,2,FALSE)' # Lookup
# Array formula (Excel 365 dynamic array)
ws['K2'] = '=UNIQUE(A2:A100)'
# Named range
from openpyxl.workbook.defined_name import DefinedName
ref = f"Report!$C$2:$C${len(data)+1}"
defn = DefinedName('RevenueRange', attr_text=ref)
wb.defined_names.add(defn)
Charts
from openpyxl.chart import BarChart, LineChart, PieChart, Reference
# Bar chart
chart = BarChart()
chart.type = 'col'
chart.title = 'Revenue by Department'
chart.y_axis.title = 'Revenue ($)'
chart.x_axis.title = 'Department'
chart.style = 10 # Built-in style
data_ref = Reference(ws, min_col=3, min_row=1, max_row=ws.max_row)
cats_ref = Reference(ws, min_col=2, min_row=2, max_row=ws.max_row)
chart.add_data(data_ref, titles_from_data=True)
chart.set_categories(cats_ref)
chart.width = 20
chart.height = 12
ws.add_chart(chart, 'G2')
# Line chart with multiple series
line = LineChart()
line.title = 'Monthly Trends'
for col in range(3, 6):
values = Reference(ws, min_col=col, min_row=1, max_row=13)
line.add_data(values, titles_from_data=True)
cats = Reference(ws, min_col=1, min_row=2, max_row=13)
line.set_categories(cats)
ws.add_chart(line, 'G20')
Conditional Formatting
from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, DataBarRule
# Highlight cells above threshold
ws.conditional_formatting.add(
'C2:C100',
CellIsRule(operator='greaterThan', formula=['100000'],
fill=PatternFill(bgColor='C6EFCE'))
)
# Red for below target
ws.conditional_formatting.add(
'E2:E100',
CellIsRule(operator='lessThan', formula=['1'],
fill=PatternFill(bgColor='FFC7CE'),
font=Font(color='9C0006'))
)
# Color scale (green to red)
ws.conditional_formatting.add(
'E2:E100',
ColorScaleRule(
start_type='min', start_color='F8696B',
mid_type='percentile', mid_value=50, mid_color='FFEB84',
end_type='max', end_color='63BE7B'
)
)
# Data bars
ws.conditional_formatting.add(
'C2:C100',
DataBarRule(start_type='min', end_type='max', color='638EC6')
)
Data Validation
from openpyxl.worksheet.datavalidation import DataValidation
# Dropdown list
dv = DataValidation(type='list', formula1='"Active,Inactive,Pending"', allow_blank=True)
dv.error = 'Please select a valid status'
dv.errorTitle = 'Invalid Entry'
ws.add_data_validation(dv)
dv.add('D2:D100')
# Number range
nv = DataValidation(type='whole', operator='between', formula1=0, formula2=100)
nv.error = 'Value must be between 0 and 100'
ws.add_data_validation(nv)
nv.add('F2:F100')
# Date validation
dv_date = DataValidation(type='date', operator='greaterThan', formula1='2025-01-01')
ws.add_data_validation(dv_date)
dv_date.add('G2:G100')
pandas Patterns
Reading Excel
import pandas as pd
# Read single sheet
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Read with options
df = pd.read_excel('data.xlsx',
sheet_name='Sales',
header=0,
usecols='A:E',
dtype={'ID': str, 'Revenue': float},
parse_dates=['Date'],
na_values=['N/A', 'null', ''],
)
# Read all sheets
sheets = pd.read_excel('data.xlsx', sheet_name=None) # Dict of DataFrames
Writing Excel with pandas + openpyxl
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
df_summary.to_excel(writer, sheet_name='Summary', index=False)
df_detail.to_excel(writer, sheet_name='Detail', index=False)
# Access openpyxl workbook for formatting
wb = writer.book
ws = wb['Summary']
# Apply formatting...
Pivot Tables
# Create pivot table
pivot = pd.pivot_table(
df,
values='Revenue',
index='Department',
columns='Quarter',
aggfunc='sum',
margins=True,
margins_name='Total'
)
# Write to Excel with formatting
pivot.to_excel(writer, sheet_name='Pivot')
CSV Import/Export
# CSV to XLSX
df = pd.read_csv('data.csv', encoding='utf-8-sig')
df.to_excel('output.xlsx', index=False)
# XLSX to CSV
df = pd.read_excel('data.xlsx')
df.to_csv('output.csv', index=False, encoding='utf-8-sig')
# Handle encoding issues
df = pd.read_csv('data.csv', encoding='latin-1') # or 'cp1252'
Large File Handling
Memory-Efficient Reading
# openpyxl read-only mode
from openpyxl import load_workbook
wb = load_workbook('large_file.xlsx', read_only=True)
ws = wb.active
for row in ws.iter_rows(min_row=2, values_only=True):
process_row(row)
wb.close()
Chunked Writing
# Write large datasets in chunks
from openpyxl import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows
wb = Workbook(write_only=True)
ws = wb.create_sheet()
# Write header
ws.append(headers)
# Write in chunks
chunk_size = 10000
for chunk in pd.read_csv('large.csv', chunksize=chunk_size):
for row in dataframe_to_rows(chunk, index=False, header=False):
ws.append(row)
wb.save('output.xlsx')
Performance Decision Table
| Rows | Strategy | Notes |
|---|---|---|
| < 10,000 | Standard openpyxl or pandas | Full formatting available |
| 10K - 100K | write_only / read_only mode, chunked | Limited formatting in write_only |
| 100K - 1M | write_only mode, consider CSV instead | Near Excel row limit |
| > 1M | Use CSV or Parquet, not XLSX | Excel limit: 1,048,576 rows |
Anti-Patterns / Common Mistakes
| Anti-Pattern | Why It Fails | What To Do Instead |
|---|---|---|
| openpyxl for pure data analysis | Verbose and slow for analytics | Use pandas for data operations |
| Loading large files into memory | Memory exhaustion, crashes | Use read_only / write_only modes |
| Hardcoding row/column numbers | Breaks when data shape changes | Calculate from data length |
| Inconsistent date formats | Dates render as numbers or strings | Set number_format explicitly |
| Not closing read_only workbooks | Resource leaks | Always call wb.close() or use context manager |
| Using .xls format | Legacy, limited, security risks | Always use .xlsx |
| Formatting cells one by one | Extremely slow for large ranges | Apply styles to ranges or use named styles |
| Not testing in actual Excel | Features render differently | Test in Excel, LibreOffice, and Google Sheets |
| Forgetting to freeze header row | Poor UX when scrolling large data | Always freeze panes for data sheets |
Anti-Rationalization Guards
- Do NOT use openpyxl for data analysis that pandas handles in one line.
- Do NOT skip the row count assessment -- it determines your entire approach.
- Do NOT assume standard mode works for files over 10K rows -- use streaming modes.
- Do NOT test only in one spreadsheet application -- formatting varies.
- Do NOT forget to close workbooks opened in read_only mode.
Integration Points
| Skill | How It Connects |
|---|---|
pdf-processing |
Excel data feeds into PDF report generation |
docx-processing |
Excel data populates Word document tables |
email-composer |
Generated spreadsheets attach to professional emails |
file-organizer |
Output file naming and directory structure conventions |
database-schema-design |
Database exports to Excel for reporting |
deployment |
Automated report generation in CI/CD pipelines |
Skill Type
FLEXIBLE — Choose openpyxl for rich formatting and pandas for data analysis. Combine both when you need formatted reports from data analysis. Adapt file handling strategy to data volume.