Introduction
Data is the foundation of any successful business or analytical project, and Comma Separated Values (CSV) files have become a standard format for storing and exchanging data due to their simplicity and versatility. Python, as a popular and powerful programming language, offers several methods and libraries to efficiently read from and write to CSV files. In this article, we will explore the best practices to handle CSV files effectively using Python.
Understanding CSV Files
Before diving into Python's CSV processing capabilities, let's understand the structure of a CSV file. A CSV file consists of records, each containing one or more fields separated by a delimiter (usually a comma or a semicolon). The first row often contains headers that label the fields, making it easier to work with the data. To effectively read and write to CSV files, it is crucial to be familiar with their structure.
Utilizing the Pandas Library
For more complex data manipulation tasks, using the Pandas library is often more convenient. Pandas is a powerful data manipulation library that offers easy-to-use data structures and functions. Here's how to use Pandas for reading and writing CSV files:
import pandas as pd
# Reading from a CSV file
df = pd.read_csv('data.csv')
print(df)
# Writing to a CSV file
data_to_write = pd.DataFrame({
'Name': ['John Doe', 'Jane Smith'],
'Age': [30, 25],
'Occupation': ['Engineer', 'Designer']
})
data_to_write.to_csv('output.csv', index=False)
Using the Built-in CSV Module
Python provides a built-in csv
module that simplifies reading from and writing to CSV files. This module handles various edge cases and ensures that data is correctly formatted during processing. Here's an example of how to use the csv
module for reading and writing:
import csv
# Reading from a CSV file
with open('data.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
# Access individual fields using row[index]
print(row)
# Writing to a CSV file
data_to_write = [
['Name', 'Age', 'Occupation'],
['John Doe', 30, 'Engineer'],
['Jane Smith', 25, 'Designer']
]
with open('output.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data_to_write)
Handling CSV Options
Both the csv
module and Pandas offer various options to customize CSV file operations. For instance:
- Specifying custom delimiters (e.g., tab-separated values with
'\t'
). - Handling missing or null values during reading and writing.
- Selecting specific columns while reading or writing.
- Handling encoding issues (e.g., using 'utf-8' or 'latin-1').
Always refer to the official documentation of the csv
module or pandas
library to explore these options further.
Dealing with Large CSV Files
Reading and writing large CSV files can be resource-intensive. When dealing with massive datasets, consider the following practices:
- Use the
chunksize
parameter in Pandas to process the file in smaller chunks. - Opt for the
csv.DictReader
from thecsv
module if the file has headers, which allows you to access rows as dictionaries instead of lists.
Perfomance
To compare the performance pandas
with csv
we'll conduct two tests:
- Reading a large CSV file using Pandas'
chunksize
and processing each chunk. - Reading the same large CSV file using
csv
.
We'll use Python's time
module to measure the execution time for each approach. Let's run the tests and compare the performance:
First we will need to install pandas
library, I will be using poetry
as my main package manager
$ poetry init -n
$ poetry install pandas
We need to create a big CSV file,
import csv
import random
import string
# Function to generate random data
def generate_random_data():
return {
'Name': ''.join(random.choice(string.ascii_letters) for _ in range(10)),
'Age': random.randint(18, 60),
'Occupation': random.choice(['Engineer', 'Designer', 'Analyst', 'Manager'])
}
# Function to generate a large CSV file with random data
def generate_large_csv(file_name, num_rows=10000000):
with open(file_name, 'w', newline='') as file:
fieldnames = ['Name', 'Age', 'Occupation']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for _ in range(num_rows):
data = generate_random_data()
writer.writerow(data)
if __name__ == '__main__':
file_name = 'huge_data.csv'
generate_large_csv(file_name)
The script will create a file names huge_data.csv
with 10 million records.
Let check the generated file
# count the number of lines
$ wc -l huge_data.csv
10000001 huge_data.csv
# Read the first 5 lines
$ head -n 5 huge_data.csv
Name,Age,Occupation
XOASRpHqil,44,Engineer
dlTnZfRcSU,36,Manager
TPUoyhFtIC,46,Manager
XNgeUXmDam,26,Designer
Here is how we can compare the excution time between csv
and pandas
import pandas as pd
import csv
import time
def timing_decorator(func):
def wrapper(*args, **kwargs):
start_time = time.perf_counter(), time.process_time()
result = func(*args, **kwargs)
end_time = time.time()
end_time = time.perf_counter(), time.process_time()
print(f" Function: {func.__name__}")
print(f" Real time: {end_time[0] - start_time[0]:.2f} seconds")
print(f" CPU time: {end_time[1] - start_time[1]:.2f} seconds")
return result
return wrapper
# Reading a large CSV file using Pandas' chunksize
def process_data_chunk(chunk):
# Process each chunk of data here
pass
@timing_decorator
def process_large_csv_with_csv_reader(filename):
with open(file_name, 'r') as file:
reader = csv.reader(file)
for row in reader:
process_data_chunk([row])
@timing_decorator
def process_large_csv_with_pandas(filename, chunk_size=1000):
for chunk in pd.read_csv(filename, chunksize=chunk_size):
process_data_chunk(chunk)
if __name__ == '__main__':
file_name = 'huge_data.csv'
chunk_size = 1000
process_large_csv_with_pandas(file_name, chunk_size)
process_large_csv_with_csv_reader(file_name)
Lets run it
$ python csv_reader.py
Function: process_large_csv_with_pandas
Real time: 5.75 seconds
CPU time: 6.71 seconds
Function: process_large_csv_with_csv_reader
Real time: 2.95 seconds
CPU time: 2.95 seconds
We can see that pythons csv
is much faster than pandas
almost half of the time needed to read the same file
Memory comparison
we will use the memory-profiler
$ poetry add memory-profiler
import pandas as pd
import csv
import time
from memory_profiler import profile
# Reading a large CSV file using Pandas' chunksize
def process_data_chunk(chunk):
# Process each chunk of data here
pass
@profile
def process_large_csv_with_csv_reader(filename):
with open(file_name, 'r') as file:
reader = csv.reader(file)
for row in reader:
process_data_chunk([row])
@profile
def process_large_csv_with_pandas(filename, chunk_size=1000):
for chunk in pd.read_csv(filename, chunksize=chunk_size):
process_data_chunk(chunk)
if __name__ == '__main__':
file_name = 'huge_data.csv'
chunk_size = 1000
process_large_csv_with_pandas(file_name, chunk_size)
process_large_csv_with_csv_reader(file_name)
Output
Line # Mem usage Increment Occurrences Line Contents
=============================================================
32 68.3 MiB 68.3 MiB 1 @profile
33 def process_large_csv_with_pandas(filename, chunk_size=1000):
34 255.8 MiB -19646.2 MiB 10001 for chunk in pd.read_csv(filename, chunksize=chunk_size):
35 255.8 MiB -19830.6 MiB 10000 process_data_chunk(chunk)
Filename: /Users/ahmed/projects/csv_reader/csv_reader.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
24 252.7 MiB 252.7 MiB 1 @profile
25 def process_large_csv_with_csv_reader(filename):
26 252.7 MiB 0.0 MiB 2 with open(file_name, 'r') as file:
27 252.7 MiB 0.0 MiB 1 reader = csv.reader(file)
28 252.7 MiB 0.0 MiB 10000002 for row in reader:
29 252.7 MiB 0.0 MiB 10000001 process_data_chunk([row])
The memory profiler results show the memory usage and increments for two different functions that process a large CSV file. The functions are process_large_csv_with_pandas
and process_large_csv_with_csv_reader
, and both functions read and process a large CSV file in chunks.
Pandas
:
- The function starts with a memory usage of 68.3 MiB.
- The function's memory usage reaches a peak of 255.8 MiB at some point during the execution.
- The maximum memory increment is quite significant at -19646.2 MiB, indicating a large reduction in memory usage after the first chunk.
- After processing all 10000 chunks, the function's memory usage decreases further to -19830.6 MiB.
Built-in CSV:
- The function starts with a memory usage of 252.7 MiB.
- The function then processes the data row by row using a loop.
- The total number of occurrences for processing the rows is 10000001 (one row processed per iteration).
- The memory usage remains constant throughout the execution at 252.7 MiB, and there is no significant memory increment.
Conclusion:
pandas
seems to exhibit a large memory overhead during this process. The significant memory reduction after processing the first chunk might suggest that some temporary data structures used by pandas were freed, resulting in a drop in memory usage.- On the other hand, the built-in
csv.reader
seems to have a much smaller memory overhead. It reads the file row by row, and there are no noticeable memory increments during its execution.
It's important to note that the memory usage behavior may vary depending on the size of the CSV file being processed and the specific data processing operations performed in the process_data_chunk
function. However, it seems that the built-in csv.reader
is more memory-efficient and less time consuming approach for handling large CSV files.