Writing Data Processing Scripts with Python

Every project eventually throws the same task at you: take something in one format, convert it to another. CSV to JSON, Excel to plain text, an API response into a database record. If it’s small, you do it by hand. If it repeats and grows, writing something is inevitable.

When I decided to offload these tasks to Python, I was already deep in a Laravel project and perfectly capable of writing PHP. I could write the same scripts in PHP. So why Python? I now have a battle-tested answer to that question.

Why not PHP?

PHP was designed to respond to web requests. CLI support exists — php artisan commands work great inside Laravel projects. But when all I need to do is read a file, transform it, and write it back out, I don’t want to open composer.json and stand up a class hierarchy. The ceremony is too high for this kind of work.

The same task in Python:

#!/usr/bin/env python3
import csv
import json

def csv_to_json(input_path: str, output_path: str) -> None:
    rows = []
    with open(input_path, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            rows.append(dict(row))

    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)

    print(f"{len(rows)} kayıt dönüştürüldü.")

if __name__ == '__main__':
    csv_to_json('input.csv', 'output.json')

Zero external dependencies. csv and json are both in the standard library. Save the file, run python3 convert.py, done.

A real need: customer data cleanup

Last month I got a task like this: a thousand-row Excel file with phone numbers in all sorts of formats (05xx xxx xx xx, +905xxxxxxxxx, 0 5xx-xxx-xxxx) — normalize them all and save as CSV.

import re
import csv
import openpyxl

def normalize_phone(raw: str) -> str | None:
    """Türkiye telefon numarasını 05XXXXXXXXX formatına çevirir."""
    digits = re.sub(r'\D', '', raw)

    if digits.startswith('90') and len(digits) == 12:
        digits = '0' + digits[2:]

    if len(digits) == 10 and digits.startswith('5'):
        digits = '0' + digits

    if len(digits) == 11 and digits.startswith('05'):
        return digits

    return None  # geçersiz


def process_excel(input_path: str, output_path: str) -> None:
    wb = openpyxl.load_workbook(input_path)
    ws = wb.active

    valid = []
    invalid = []

    for row in ws.iter_rows(min_row=2, values_only=True):
        name, phone_raw = row[0], str(row[1] or '')
        normalized = normalize_phone(phone_raw)
        if normalized:
            valid.append({'name': name, 'phone': normalized})
        else:
            invalid.append({'name': name, 'phone': phone_raw})

    with open(output_path, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['name', 'phone'])
        writer.writeheader()
        writer.writerows(valid)

    print(f"Geçerli: {len(valid)}, Geçersiz: {len(invalid)}")


process_excel('musteriler.xlsx', 'temiz.csv')

openpyxl is the only external dependency. I could have written this in PHP too — PCRE instead of re, PhpSpreadsheet instead of openpyxl, and considerably more boilerplate in place of the overall simplicity. Same result, more friction.

One more detail worth mentioning: this script also collects invalid records in a separate list. I had initially skipped that step and only written the valid ones. The client came back with: “I sent you 1,000 rows and only 847 came through — what happened to the other 153?” Writing invalid records to a separate file turned out to be the cleanest way to answer that question.

Automating recurring tasks

Some scripts run once and are forgotten. Others run weekly: download from FTP, transform, load into the database. For these routines, a Python script combined with cron is more than sufficient.

# weekly_import.py
import ftplib
import csv
import psycopg2  # ya da mysql-connector-python

def download_from_ftp(remote_path: str, local_path: str) -> None:
    with ftplib.FTP('ftp.supplier.com') as ftp:
        ftp.login(user='user', passwd='pass')
        with open(local_path, 'wb') as f:
            ftp.retrbinary(f'RETR {remote_path}', f.write)

def import_products(csv_path: str) -> int:
    conn = psycopg2.connect(dsn="...")
    cur = conn.cursor()
    count = 0

    with open(csv_path, newline='', encoding='utf-8') as f:
        for row in csv.DictReader(f):
            cur.execute(
                "INSERT INTO products (sku, name, price) VALUES (%s, %s, %s) "
                "ON CONFLICT (sku) DO UPDATE SET name = EXCLUDED.name, price = EXCLUDED.price",
                (row['SKU'], row['Name'], float(row['Price']))
            )
            count += 1

    conn.commit()
    cur.close()
    conn.close()
    return count

This code lives outside my web application; it has nothing to do with Laravel. An independent tool, an independent process.

What to watch out for as a script grows

Single-file scripts are fine as long as they stay small. But over time, this happens: first one function, then a few functions, then the thought “that other script could use this function too” — and the copy-paste era begins again. At that point, creating a utils.py or a small package structure is healthier. In Python, pip install -e . lets you do an editable install for local development; your scripts can then import from that local package.

There’s also the error handling question. For a one-off script, something that roughly works may be enough. But for a script that runs automatically every week, clear output showing what happened and a meaningful message when something unexpected goes wrong actually matter. Deferring try/except blocks to “later” in production scripts is a decision you’ll regret.

The language selection decision

Here’s the rule I settled on for data processing scripts: if the job requires a specialist library (Excel, SFTP, complex regex), Python. If it needs to integrate tightly with a Laravel project (read one model, create another), a php artisan command. If the scale is small and performance matters, I might even reach for Go.

Keeping that separation clear produces better outcomes than forcing everything into a single language.

Python becoming my go-to for data processing scripts isn’t a language conversion — it’s the maturation of a habit: choosing the right tool. PHP is still my primary language; Python just fills the gap where it fits best.

Writing Data Processing Scripts with Python

Why not PHP?

A real need: customer data cleanup

Automating recurring tasks

What to watch out for as a script grows

The language selection decision

Comments

Related Posts

Automating repetitive tasks with Python

Being Polyglot Is an Outcome, Not a Goal

Four Languages in One Year: The Mental Flexibility That Polyglot Practice Brings