Skip to content
Muhammet Şafak
tr
Languages 5 min read

Writing Data Processing Scripts with Python

The practical case for offloading repetitive transformation tasks to Python, with real script examples.


Every project eventually throws the same task at you: take something in one format, convert it to another. CSV to JSON, Excel to plain text, an API response into a database record. If it’s small, you do it by hand. If it repeats and grows, writing something is inevitable.

When I decided to offload these tasks to Python, I was already deep in a Laravel project and perfectly capable of writing PHP. I could write the same scripts in PHP. So why Python? I now have a battle-tested answer to that question.

Why not PHP?

PHP was designed to respond to web requests. CLI support exists — php artisan commands work great inside Laravel projects. But when all I need to do is read a file, transform it, and write it back out, I don’t want to open composer.json and stand up a class hierarchy. The ceremony is too high for this kind of work.

The same task in Python:

#!/usr/bin/env python3
import csv
import json

def csv_to_json(input_path: str, output_path: str) -> None:
    rows = []
    with open(input_path, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            rows.append(dict(row))

    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)

    print(f"{len(rows)} kayıt dönüştürüldü.")

if __name__ == '__main__':
    csv_to_json('input.csv', 'output.json')

Zero external dependencies. csv and json are both in the standard library. Save the file, run python3 convert.py, done.

A real need: customer data cleanup

Last month I got a task like this: a thousand-row Excel file with phone numbers in all sorts of formats (05xx xxx xx xx, +905xxxxxxxxx, 0 5xx-xxx-xxxx) — normalize them all and save as CSV.

import re
import csv
import openpyxl

def normalize_phone(raw: str) -> str | None:
    """Türkiye telefon numarasını 05XXXXXXXXX formatına çevirir."""
    digits = re.sub(r'\D', '', raw)

    if digits.startswith('90') and len(digits) == 12:
        digits = '0' + digits[2:]

    if len(digits) == 10 and digits.startswith('5'):
        digits = '0' + digits

    if len(digits) == 11 and digits.startswith('05'):
        return digits

    return None  # geçersiz


def process_excel(input_path: str, output_path: str) -> None:
    wb = openpyxl.load_workbook(input_path)
    ws = wb.active

    valid = []
    invalid = []

    for row in ws.iter_rows(min_row=2, values_only=True):
        name, phone_raw = row[0], str(row[1] or '')
        normalized = normalize_phone(phone_raw)
        if normalized:
            valid.append({'name': name, 'phone': normalized})
        else:
            invalid.append({'name': name, 'phone': phone_raw})

    with open(output_path, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['name', 'phone'])
        writer.writeheader()
        writer.writerows(valid)

    print(f"Geçerli: {len(valid)}, Geçersiz: {len(invalid)}")


process_excel('musteriler.xlsx', 'temiz.csv')

openpyxl is the only external dependency. I could have written this in PHP too — PCRE instead of re, PhpSpreadsheet instead of openpyxl, and considerably more boilerplate in place of the overall simplicity. Same result, more friction.

One more detail worth mentioning: this script also collects invalid records in a separate list. I had initially skipped that step and only written the valid ones. The client came back with: “I sent you 1,000 rows and only 847 came through — what happened to the other 153?” Writing invalid records to a separate file turned out to be the cleanest way to answer that question.

Automating recurring tasks

Some scripts run once and are forgotten. Others run weekly: download from FTP, transform, load into the database. For these routines, a Python script combined with cron is more than sufficient.

# weekly_import.py
import ftplib
import csv
import psycopg2  # ya da mysql-connector-python

def download_from_ftp(remote_path: str, local_path: str) -> None:
    with ftplib.FTP('ftp.supplier.com') as ftp:
        ftp.login(user='user', passwd='pass')
        with open(local_path, 'wb') as f:
            ftp.retrbinary(f'RETR {remote_path}', f.write)

def import_products(csv_path: str) -> int:
    conn = psycopg2.connect(dsn="...")
    cur = conn.cursor()
    count = 0

    with open(csv_path, newline='', encoding='utf-8') as f:
        for row in csv.DictReader(f):
            cur.execute(
                "INSERT INTO products (sku, name, price) VALUES (%s, %s, %s) "
                "ON CONFLICT (sku) DO UPDATE SET name = EXCLUDED.name, price = EXCLUDED.price",
                (row['SKU'], row['Name'], float(row['Price']))
            )
            count += 1

    conn.commit()
    cur.close()
    conn.close()
    return count

This code lives outside my web application; it has nothing to do with Laravel. An independent tool, an independent process.

What to watch out for as a script grows

Single-file scripts are fine as long as they stay small. But over time, this happens: first one function, then a few functions, then the thought “that other script could use this function too” — and the copy-paste era begins again. At that point, creating a utils.py or a small package structure is healthier. In Python, pip install -e . lets you do an editable install for local development; your scripts can then import from that local package.

There’s also the error handling question. For a one-off script, something that roughly works may be enough. But for a script that runs automatically every week, clear output showing what happened and a meaningful message when something unexpected goes wrong actually matter. Deferring try/except blocks to “later” in production scripts is a decision you’ll regret.

The language selection decision

Here’s the rule I settled on for data processing scripts: if the job requires a specialist library (Excel, SFTP, complex regex), Python. If it needs to integrate tightly with a Laravel project (read one model, create another), a php artisan command. If the scale is small and performance matters, I might even reach for Go.

Keeping that separation clear produces better outcomes than forcing everything into a single language.


Python becoming my go-to for data processing scripts isn’t a language conversion — it’s the maturation of a habit: choosing the right tool. PHP is still my primary language; Python just fills the gap where it fits best.

Tags: #Python
Share:

Comments

Sign in with your GitHub account to join the discussion. Comments are stored in GitHub Discussions.

Related Posts

Search the site

Start typing to search posts, projects and pages.

Esc to close Powered by Pagefind