Using Multiple Files in Python Script

If you have multiple CSV files (or any other type of files) that you need to process without manually editing your Python script for each file, you can automate this process by using Python's file handling and looping capabilities. Here's how you can do it:

Iterating Over Files in a Directory

You can use the os or glob module in Python to list all the files in a directory and then iterate over them with a loop. This way, your script automatically handles each file one by one.

Using `glob`

The glob module is very handy for listing files from a directory with pattern matching. If all your CSV files are in the same directory and follow a naming pattern, you can use it like so:

pythonCopy codeimport glob
import pandas as pd

# Path to the directory where your CSV files are stored
path = 'path/to/csv/files/*.csv'  # The '*' pattern matches all files ending with '.csv'

# Loop over files
for filename in glob.glob(path):
    # Read each CSV file
    df = pd.read_csv(filename)

    # Process DataFrame here
    process(df)  # Replace 'process(df)' with your actual processing function

Using `os`

The os module allows more flexibility in handling directory and file paths. If you need to list files and maybe filter them based on more complex criteria, this can be a good choice:

pythonCopy codeimport os
import pandas as pd

# Directory where your CSV files are stored
directory = 'path/to/csv/files'

# Loop over files in the specified directory
for filename in os.listdir(directory):
    if filename.endswith('.csv'):
        file_path = os.path.join(directory, filename)

        # Read the CSV file
        df = pd.read_csv(file_path)

        # Process DataFrame here
        process(df)  # Replace 'process(df)' with your actual processing function

Automating the Process for Large Datasets

If your datasets are very large and you're concerned about memory usage, you can combine the file iteration with chunk-based processing as shown in the earlier example. Here's a simple way to do it with pandas:

pythonCopy codeimport glob

# Path to the directory with your CSV files
path = 'path/to/csv/files/*.csv'

# Define chunk size
chunk_size = 10000

# Loop over each CSV file
for filename in glob.glob(path):
    # Process each file in chunks
    for chunk in pd.read_csv(filename, chunksize=chunk_size):
        # Process each chunk here
        process(chunk)  # Your processing function

Key Points

Automation: Using glob.glob or os.listdir allows you to automatically iterate over files in a directory, eliminating the need for manual file name entries or multiple Python scripts.
Memory Management: Combining file iteration with chunk-based processing helps manage memory usage, especially with large datasets.
Flexibility: This approach provides the flexibility to process numerous files systematically and can be adapted to various data formats or processing requirements.

This method simplifies handling multiple data files in a scalable and efficient manner, making it suitable for batch processing large datasets.