As Data Scientists, we often land in situations when other members of our team or an adjacent team needs to run and test our application on their own. At that point, making sure that our project has enough flexibility for a person outside our immediate technical team to easily handle all the requirements to run the app is important.
This is one of the areas where configuration files come in handy. Our machine learning project can become extremely reusable by anyone within or outside our team by using settings and parameters that can modified outside the regular code of the app.
A config file helps achieve that, and in this article, let's see an easy method of using them in a simple machine learning process and how it decouples our parameters and initial settings from the ML code.
Installing the required library
YAML (YAML Ain't Markup Language) provides one of the best ways to make these config files. The PyYAML library is a YAML parser and emitter tool for Python.
Let's install it in our virtual environment.
pip install pyyaml
A simple ML Example
Let's use a simple example of an MNIST Image Classification model here:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Reshaping the array to 4-dims
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalizing the RGB values
x_train /= 255
x_test /= 255
# Creating a Sequential Model
model = Sequential()
model.add(Conv2D(28, kernel_size=(3,3), input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten()) # Flattening the 2D arrays for fully connected layers
model.add(Dense(128, activation=tf.nn.relu))
model.add(Dropout(0.2))
model.add(Dense(10,activation=tf.nn.softmax))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x=x_train, y=y_train, epochs=10)
Let's look at this simple 22 line ML code and see if there are parameters that we would like to be configurable:
- The input shape can be defined from outside, and if the data source were needed from an external source, we would've also defined the directories outside
- The Conv2D filters and kernel sizes should be configurable
- The MaxPooling2D Layer's pool size should be configurable
- Optimizer, loss function, the list of metrics, and number of epochs should also be configurable
As we can deduce here, the amount of things that can be modified from outside the actual code is plenty, and as the size of our project grows, so does the number of these configurable parameters, and so does risk on its efficient management.
Therefore, let's see how to decouple them using YAML config files.
Creating a YAML Config file
Here are some basic rules for you to remember when creating and writing a YAML file:
- They have
.yaml
as the extension - They are written as key-value pairs
- They can have numerical(integer and floating point), boolean, string(without quotes), and array values
- They are case-sensitive
Create a new file called config.yaml
and let's write down the following parameters there:
height: 28
width: 28
channels: 1
conv2d_filters: 28
conv2d_kernel_size: 3
pool_size: 2
dense_units_1: 128
dropout: 0.2
dense_units_2: 10
optimizer: adam
loss: sparse_categorical_crossentropy
metrics: ['accuracy']
epochs: 10
Now, we can use the PyYAML
library in our code to pull these values from our config file when we run the script:
import yaml
#read yaml file
with open('./config.yaml') as file: # use your path to config
config_data = yaml.load(file)
We can use our config_data variable now as follows:
model.add(Conv2D(config_data["conv2d_filters"], kernel_size=(config_data["conv2d_kernel_size"], config_data["conv2d_kernel_size"]), input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(config_data["pool_size"])))
... and so on
model.fit(x=x_train, y=y_train, epochs = config_data["epochs"])
Now, our project looks much more readable since we no longer need to change our parameters inside the code, it can all be done from the config file!
In bigger projects, you can even use directory paths and saved model names defined inside your config, especially if you're deploying them on the cloud platforms:
data_directory: ./data/
file_name: my_csv_file.csv
saved_model_path: ./data/model/
saved_model_name: mnist_v2.2.1.h5
Parting words…
So there you have the fundamentals of using config files in your ML apps!
The practise of using config files not only makes our code more human friendly and reproducible, but it also makes it helpful for a larger scale ML tracking and hyperparameter search experiments using third party software such as Weights and Biases.
A couple of resources that would be helpful for you to go through:
- The PyYAML docs: https://pyyaml.org/wiki/PyYAMLDocumentation
- The master repository that contains the code for this and all of my other data science articles: https://github.com/yashprakash13/data-another-day