Monday, 05 December 2022

Use Python to load raw text data files

Python  Text Analysis 

It is very convenient to use Python to load raw text data into the memory for data processing. Here, we provide the code template for your reference. It shows you how to list the data files that you want and read the text data into a list.

First, we need import the modules that we want.

import re
from os import listdir

We need the re (Regular Expression) module to search the files with specific file name pattern. The os (Operating System) module is also needed to get the names of the entries in a given path.

Suppose, under the path where we put the Python script, we have a sub directory called data-dir which contains raw text data files. We can define the data path as below.

data_dir = './data-dir'

Suppose the data file names are in the Data-yyyy-mm-dd.csv format, e.g. Data-2022-11-09.csv. Then we can use the below code snippet to get all the data file names ending with .csv.

file_name_list = [file_name for file_name in listdir(data_dir) if file_name.endswith('.csv')]

The function listdir() returns a list with the names of the entries in the given path ./data-dir. The String's function endswith() checks if the file name ends with .csv.

With all the necessary file names, we can iteratively open the files and load the raw text data into a list raw_data. If the data files are encoded in UTF-8, in the file open() function, we can use the encoding argument to clarify that.

raw_data = []
for file_name in file_name_list:
    content = open(data_dir + "/" + file_name, 'r', encoding='UTF-8').read()
    raw_data.append(content)

If specific data files are needed, for example, select all data files from November, we can use the re module to match the file name pattern. We can use the re.match() function when listing the file names as below.

file_name_list = [file_name for file_name in listdir(data_dir) if re.match('^Data-2022-11-[0-9]+\.csv$', file_name) ]

The function re.match('^Data-2022-11-[0-9]+\.csv$', file_name) checks if the data file belongs to the same month, i.e. November.

A quick explanation for the regular expression used in this re.match() example is:

  • ^ start of the string
  • [0-9]+ one or more occurrences of numbers
  • \. escape the explicit dot
  • $ end of the string

The full source code is provided below. You can just copy and paste and change as what you want.

import re
from os import listdir

data_dir = './data-dir'

# select all data files ending with .csv
file_name_list = [file_name for file_name in listdir(data_dir) if file_name.endswith('.csv')]

# if you want select some specific data files, use this
# file_name_list = [file_name for file_name in listdir(data_dir) if re.match('^Data-2022-11-[0-9]+\.csv$', file_name) ]

# load files into a list
raw_data = []
for file_name in file_name_list:
    content = open(data_dir + "/" + file_name, 'r', encoding='UTF-8').read()
    raw_data.append(content)


Search