Introduction

Data is the foundation of any successful business or analytical project, and Comma Separated Values (CSV) files have become a standard format for storing and exchanging data due to their simplicity and versatility. Python, as a popular and powerful programming language, offers several methods and libraries to efficiently read from and write to CSV files. In this article, we will explore the best practices to handle CSV files effectively using Python.

Understanding CSV Files

Before diving into Python's CSV processing capabilities, let's understand the structure of a CSV file. A CSV file consists of records, each containing one or more fields separated by a delimiter (usually a comma or a semicolon). The first row often contains headers that label the fields, making it easier to work with the data. To effectively read and write to CSV files, it is crucial to be familiar with their structure.

Utilizing the Pandas Library

For more complex data manipulation tasks, using the Pandas library is often more convenient. Pandas is a powerful data manipulation library that offers easy-to-use data structures and functions. Here's how to use Pandas for reading and writing CSV files:

import pandas as pd

# Reading from a CSV file
df = pd.read_csv('data.csv')
print(df)

# Writing to a CSV file
data_to_write = pd.DataFrame({
    'Name': ['John Doe', 'Jane Smith'],
    'Age': [30, 25],
    'Occupation': ['Engineer', 'Designer']
})

data_to_write.to_csv('output.csv', index=False)

Using the Built-in CSV Module

Python provides a built-in csv module that simplifies reading from and writing to CSV files. This module handles various edge cases and ensures that data is correctly formatted during processing. Here's an example of how to use the csv module for reading and writing:

import csv

# Reading from a CSV file
with open('data.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        # Access individual fields using row[index]
        print(row)

# Writing to a CSV file
data_to_write = [
    ['Name', 'Age', 'Occupation'],
    ['John Doe', 30, 'Engineer'],
    ['Jane Smith', 25, 'Designer']
]

with open('output.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data_to_write)

Handling CSV Options

Both the csv module and Pandas offer various options to customize CSV file operations. For instance:

  • Specifying custom delimiters (e.g., tab-separated values with '\t').
  • Handling missing or null values during reading and writing.
  • Selecting specific columns while reading or writing.
  • Handling encoding issues (e.g., using 'utf-8' or 'latin-1').

Always refer to the official documentation of the csv module or pandas library to explore these options further.

Dealing with Large CSV Files

Reading and writing large CSV files can be resource-intensive. When dealing with massive datasets, consider the following practices:

  • Use the chunksize parameter in Pandas to process the file in smaller chunks.
  • Opt for the csv.DictReader from the csv module if the file has headers, which allows you to access rows as dictionaries instead of lists.

Perfomance

To compare the performance pandaswith csv we'll conduct two tests:

  1. Reading a large CSV file using Pandas' chunksize and processing each chunk.
  2. Reading the same large CSV file using csv .

We'll use Python's time module to measure the execution time for each approach. Let's run the tests and compare the performance:

First we will need to install pandaslibrary, I will be using poetry as my main package manager

$ poetry init -n
$ poetry install pandas

We need to create a big CSV file,

import csv
import random
import string

# Function to generate random data
def generate_random_data():
    return {
        'Name': ''.join(random.choice(string.ascii_letters) for _ in range(10)),
        'Age': random.randint(18, 60),
        'Occupation': random.choice(['Engineer', 'Designer', 'Analyst', 'Manager'])
    }

# Function to generate a large CSV file with random data
def generate_large_csv(file_name, num_rows=10000000):
    with open(file_name, 'w', newline='') as file:
        fieldnames = ['Name', 'Age', 'Occupation']
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()

        for _ in range(num_rows):
            data = generate_random_data()
            writer.writerow(data)

if __name__ == '__main__':
    file_name = 'huge_data.csv'
    generate_large_csv(file_name)

The script will create a file names huge_data.csv with 10 million records.

Let check the generated file

# count the number of lines
$ wc -l huge_data.csv
10000001 huge_data.csv

# Read the first 5 lines
$ head -n 5 huge_data.csv
Name,Age,Occupation
XOASRpHqil,44,Engineer
dlTnZfRcSU,36,Manager
TPUoyhFtIC,46,Manager
XNgeUXmDam,26,Designer

Here is how we can compare the excution time between csv and pandas

import pandas as pd
import csv
import time

def timing_decorator(func):
    def wrapper(*args, **kwargs):
        start_time = time.perf_counter(), time.process_time()
        result = func(*args, **kwargs)
        end_time = time.time()
        end_time = time.perf_counter(), time.process_time()
        print(f" Function: {func.__name__}")
        print(f" Real time: {end_time[0] - start_time[0]:.2f} seconds")
        print(f" CPU time: {end_time[1] - start_time[1]:.2f} seconds")
        return result
    return wrapper

# Reading a large CSV file using Pandas' chunksize
def process_data_chunk(chunk):
    # Process each chunk of data here
    pass

@timing_decorator
def process_large_csv_with_csv_reader(filename):
    with open(file_name, 'r') as file:
        reader = csv.reader(file)
        for row in reader:
            process_data_chunk([row])

@timing_decorator
def process_large_csv_with_pandas(filename, chunk_size=1000):
    for chunk in pd.read_csv(filename, chunksize=chunk_size):
        process_data_chunk(chunk)

if __name__ == '__main__':
    file_name = 'huge_data.csv'
    chunk_size = 1000

    process_large_csv_with_pandas(file_name, chunk_size)
    process_large_csv_with_csv_reader(file_name)

Lets run it

 $ python csv_reader.py
 Function: process_large_csv_with_pandas
 Real time: 5.75 seconds
 CPU time: 6.71 seconds

 Function: process_large_csv_with_csv_reader
 Real time: 2.95 seconds
 CPU time: 2.95 seconds

We can see that pythons csv is much faster than pandas almost half of the time needed to read the same file

Memory comparison

we will use the memory-profiler

$ poetry add memory-profiler
import pandas as pd
import csv
import time
from memory_profiler import profile

# Reading a large CSV file using Pandas' chunksize
def process_data_chunk(chunk):
    # Process each chunk of data here
    pass

@profile
def process_large_csv_with_csv_reader(filename):
    with open(file_name, 'r') as file:
        reader = csv.reader(file)
        for row in reader:
            process_data_chunk([row])

@profile
def process_large_csv_with_pandas(filename, chunk_size=1000):
    for chunk in pd.read_csv(filename, chunksize=chunk_size):
        process_data_chunk(chunk)

if __name__ == '__main__':
    file_name = 'huge_data.csv'
    chunk_size = 1000

    process_large_csv_with_pandas(file_name, chunk_size)
    process_large_csv_with_csv_reader(file_name)

Output

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    32     68.3 MiB     68.3 MiB           1   @profile
    33                                         def process_large_csv_with_pandas(filename, chunk_size=1000):
    34    255.8 MiB -19646.2 MiB       10001       for chunk in pd.read_csv(filename, chunksize=chunk_size):
    35    255.8 MiB -19830.6 MiB       10000           process_data_chunk(chunk)


Filename: /Users/ahmed/projects/csv_reader/csv_reader.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    24    252.7 MiB    252.7 MiB           1   @profile
    25                                         def process_large_csv_with_csv_reader(filename):
    26    252.7 MiB      0.0 MiB           2       with open(file_name, 'r') as file:
    27    252.7 MiB      0.0 MiB           1           reader = csv.reader(file)
    28    252.7 MiB      0.0 MiB    10000002           for row in reader:
    29    252.7 MiB      0.0 MiB    10000001               process_data_chunk([row])

The memory profiler results show the memory usage and increments for two different functions that process a large CSV file. The functions are process_large_csv_with_pandas and process_large_csv_with_csv_reader, and both functions read and process a large CSV file in chunks.

Pandas:

  • The function starts with a memory usage of 68.3 MiB.
  • The function's memory usage reaches a peak of 255.8 MiB at some point during the execution.
  • The maximum memory increment is quite significant at -19646.2 MiB, indicating a large reduction in memory usage after the first chunk.
  • After processing all 10000 chunks, the function's memory usage decreases further to -19830.6 MiB.

Built-in CSV:

  • The function starts with a memory usage of 252.7 MiB.
  • The function then processes the data row by row using a loop.
  • The total number of occurrences for processing the rows is 10000001 (one row processed per iteration).
  • The memory usage remains constant throughout the execution at 252.7 MiB, and there is no significant memory increment.

Conclusion:

  • pandas seems to exhibit a large memory overhead during this process. The significant memory reduction after processing the first chunk might suggest that some temporary data structures used by pandas were freed, resulting in a drop in memory usage.
  • On the other hand, the built-in csv.reader seems to have a much smaller memory overhead. It reads the file row by row, and there are no noticeable memory increments during its execution.

It's important to note that the memory usage behavior may vary depending on the size of the CSV file being processed and the specific data processing operations performed in the process_data_chunk function. However, it seems that the built-in csv.readeris more memory-efficient and less time consuming approach for handling large CSV files.