Writing Data Processing Scripts with Python
The practical case for offloading repetitive transformation tasks to Python, with real script examples.
Every project eventually throws the same task at you: take something in one format, convert it to another. CSV to JSON, Excel to plain text, an API response into a database record. If it’s small, you do it by hand. If it repeats and grows, writing something is inevitable.
When I decided to offload these tasks to Python, I was already deep in a Laravel project and perfectly capable of writing PHP. I could write the same scripts in PHP. So why Python? I now have a battle-tested answer to that question.
Why not PHP?
PHP was designed to respond to web requests. CLI support exists — php artisan commands work great inside Laravel projects. But when all I need to do is read a file, transform it, and write it back out, I don’t want to open composer.json and stand up a class hierarchy. The ceremony is too high for this kind of work.
The same task in Python:
#!/usr/bin/env python3
import csv
import json
def csv_to_json(input_path: str, output_path: str) -> None:
rows = []
with open(input_path, newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
rows.append(dict(row))
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(rows, f, ensure_ascii=False, indent=2)
print(f"{len(rows)} kayıt dönüştürüldü.")
if __name__ == '__main__':
csv_to_json('input.csv', 'output.json')
Zero external dependencies. csv and json are both in the standard library. Save the file, run python3 convert.py, done.
A real need: customer data cleanup
Last month I got a task like this: a thousand-row Excel file with phone numbers in all sorts of formats (05xx xxx xx xx, +905xxxxxxxxx, 0 5xx-xxx-xxxx) — normalize them all and save as CSV.
import re
import csv
import openpyxl
def normalize_phone(raw: str) -> str | None:
"""Türkiye telefon numarasını 05XXXXXXXXX formatına çevirir."""
digits = re.sub(r'\D', '', raw)
if digits.startswith('90') and len(digits) == 12:
digits = '0' + digits[2:]
if len(digits) == 10 and digits.startswith('5'):
digits = '0' + digits
if len(digits) == 11 and digits.startswith('05'):
return digits
return None # geçersiz
def process_excel(input_path: str, output_path: str) -> None:
wb = openpyxl.load_workbook(input_path)
ws = wb.active
valid = []
invalid = []
for row in ws.iter_rows(min_row=2, values_only=True):
name, phone_raw = row[0], str(row[1] or '')
normalized = normalize_phone(phone_raw)
if normalized:
valid.append({'name': name, 'phone': normalized})
else:
invalid.append({'name': name, 'phone': phone_raw})
with open(output_path, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'phone'])
writer.writeheader()
writer.writerows(valid)
print(f"Geçerli: {len(valid)}, Geçersiz: {len(invalid)}")
process_excel('musteriler.xlsx', 'temiz.csv')
openpyxl is the only external dependency. I could have written this in PHP too — PCRE instead of re, PhpSpreadsheet instead of openpyxl, and considerably more boilerplate in place of the overall simplicity. Same result, more friction.
One more detail worth mentioning: this script also collects invalid records in a separate list. I had initially skipped that step and only written the valid ones. The client came back with: “I sent you 1,000 rows and only 847 came through — what happened to the other 153?” Writing invalid records to a separate file turned out to be the cleanest way to answer that question.
Automating recurring tasks
Some scripts run once and are forgotten. Others run weekly: download from FTP, transform, load into the database. For these routines, a Python script combined with cron is more than sufficient.
# weekly_import.py
import ftplib
import csv
import psycopg2 # ya da mysql-connector-python
def download_from_ftp(remote_path: str, local_path: str) -> None:
with ftplib.FTP('ftp.supplier.com') as ftp:
ftp.login(user='user', passwd='pass')
with open(local_path, 'wb') as f:
ftp.retrbinary(f'RETR {remote_path}', f.write)
def import_products(csv_path: str) -> int:
conn = psycopg2.connect(dsn="...")
cur = conn.cursor()
count = 0
with open(csv_path, newline='', encoding='utf-8') as f:
for row in csv.DictReader(f):
cur.execute(
"INSERT INTO products (sku, name, price) VALUES (%s, %s, %s) "
"ON CONFLICT (sku) DO UPDATE SET name = EXCLUDED.name, price = EXCLUDED.price",
(row['SKU'], row['Name'], float(row['Price']))
)
count += 1
conn.commit()
cur.close()
conn.close()
return count
This code lives outside my web application; it has nothing to do with Laravel. An independent tool, an independent process.
What to watch out for as a script grows
Single-file scripts are fine as long as they stay small. But over time, this happens: first one function, then a few functions, then the thought “that other script could use this function too” — and the copy-paste era begins again. At that point, creating a utils.py or a small package structure is healthier. In Python, pip install -e . lets you do an editable install for local development; your scripts can then import from that local package.
There’s also the error handling question. For a one-off script, something that roughly works may be enough. But for a script that runs automatically every week, clear output showing what happened and a meaningful message when something unexpected goes wrong actually matter. Deferring try/except blocks to “later” in production scripts is a decision you’ll regret.
The language selection decision
Here’s the rule I settled on for data processing scripts: if the job requires a specialist library (Excel, SFTP, complex regex), Python. If it needs to integrate tightly with a Laravel project (read one model, create another), a php artisan command. If the scale is small and performance matters, I might even reach for Go.
Keeping that separation clear produces better outcomes than forcing everything into a single language.
Python becoming my go-to for data processing scripts isn’t a language conversion — it’s the maturation of a habit: choosing the right tool. PHP is still my primary language; Python just fills the gap where it fits best.
Comments
Sign in with your GitHub account to join the discussion. Comments are stored in GitHub Discussions.