How To Use Argument Parsing for Greater Efficiency in Machine Learning Workflows | by Thomas A Dorfer | Mar, 2023

How To Use Argument Parsing for Greater Efficiency in Machine Learning Workflows | by Thomas A Dorfer | Mar, 2023

[ad_1]

Image by the Author.

If you’ve spent some time roaming around in the world of data science or software engineering, you have most likely come across some applications that require you to use your command-line interface, or CLI. Common examples include Azure CLI for managing Azure resources or Git for version control and source code management.

The same type of functionality and program interactivity can be achieved with your own custom Python application. Command-line arguments are a great tool to enrich your application with the necessary flexibility that allows you and your users to seamlessly configure and customize the behavior of the program.

One popular and (perhaps) the most frequently used Python library for parsing command-line arguments is argparse. In this article, we’ll explore some of its core functionalities and — using concrete examples — take a closer look at how to efficiently leverage them for Python applications.

Python’s argparse module offers an intuitive and user-friendly way to parse command-line arguments. In a nutshell, all you need to do is (1) create an ArgumentParser object, (2) add your argument specifications through the add_argument() method, and (3) run the parser with the parse_args() method. Let’s now explore each of these three steps in a bit more detail and see how they can be combined to form a fully functional command-line parser.

First, the ArgumentParser object serves as a container that holds necessary information such as the name of the program or a brief description of it. This information can be retrieved by the user through the help arguments -h or --help and provides them with a better understanding of the intent of the program.

import argparse

parser = argparse.ArgumentParser(
prog = 'Sample Program',
description='Description of the sample program'
)

Second, we can add positional or optional arguments through the add_argument() method. Positional arguments are specified by simply providing the argument name, whereas optional arguments need to be identified through the prefix. A single dash is used to specify an abbreviated version of the argument, usually a single letter, and a double dash is used to provide a more descriptive argument name.

# adding positional argument
parser.add_argument('filename')

# adding optional argument
parser.add_argument('-p', '--parameter')

Finally, we can run the parser using the parse_args() method, which then allows us to access and manipulate the arguments specified in the CLI.

args = parser.parse_args()

# print the parsed arguments
print("filename: " + args.filename + "\n" + "parameter: " + args.parameter)

We can now run this program — let’s name it program.py — in our CLI with some random arguments to see how it works:

>>> python program.py 'sample_data.csv' -p 10
filename: sample_data.csv
parameter: 10

We have now constructed a functionality that allows us to specify input arguments directly on the command-line and then use these to perform any operations we would like. You can probably imagine by now how useful this can be for any development process that requires the repeated run of a program and thus an easy and seamless way of interaction.

Let’s assume you have built a machine learning or deep learning model in Python and you would like to run it using different hyperparameters, such as learning rate, batch size, or number of epochs, and store the results in different directories.

Specifying these hyperparameters directly on the command-line considerably simplifies the way you interact with that program. It enables you to do experimentation with different model configurations without actually having to modify the underlying source code, thus also reducing the likelihood of introducing unintended bugs.

Example: Train a Random Forest Classifier

Imagine you want to build an experimentation workflow that allows you to seamlessly and repeatedly train a random forest classifier. You want to configure it in such a way that you can simply pass the training dataset, some hyperparameters, and the model’s target directory to the CLI and it will just run it, train the model and store it in the specified location for you.

For this example, we’ll use the publicly available Iris Species dataset. We can load the dataset through seaborn and save it as iris.csv.

import searborn as sns
iris = sns.load_dataset("iris")
iris.to_csv('iris.csv', index=False)

The get a better idea of what our data looks like, we can visualize it with a pair plot:

Image by the Author. Dataset used: Iris Species. License: CC0 Public Domain.

Now on to our main task: building a parser functionality into our Python program. To train the random forest classifier, we’d like to pass it the training dataset — that’s going to be our positional argument — two hyperparameters, and a target path where our model will be stored. The latter ones will be our optional arguments.

For the optional arguments, we will also specify the type through the type keyword, the default values through the default keyword, and a helpful description of the argument through the help keyword.

Then, we will parse the arguments and store the results in the variable args, which we will later use to specify the dataset that we’re reading in, the hyperparameters to train the classifier, and the location where we’d like the model to be saved.

Here’s how this looks in code:

import argparse
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier

# Define and parse command-line arguments
parser = argparse.ArgumentParser(
prog = 'Model training',
description = 'Train a random forest classifier on the iris dataset'
)
parser.add_argument(
'train_data', help='training data (.csv format)'
)
parser.add_argument(
'--n_estimators', type=int, default=100,
help='number of trees in the forest (default: 100)'
)
parser.add_argument(
'--max_depth', type=int, default=None,
help='maximum depth of the tree (default: None)'
)
parser.add_argument(
'--model_path', type=str, default='./model.pkl',
help='path to save the trained model (default: ./model.pkl)'
)
args = parser.parse_args()

# Read the dataset
iris = pd.read_csv(args.train_data)
X = iris.loc[:, iris.columns != 'species']
y = iris['species']

# Train a random forest classifier with the specified hyperparameters
clf = RandomForestClassifier(
n_estimators=args.n_estimators,
max_depth=args.max_depth,
random_state=42)

clf.fit(X, y)

# Save the trained model to a pickle file
with open(args.model_path, 'wb') as f:
pickle.dump(clf, f)

Now, let’s save that script as train_rf.py and place it in the same directory as our training dataset, iris.csv.

Next, we open up a terminal window from which we can call this program with custom-defined arguments. In the example below, we specify n_estimators to be 100 and max_depth to be 10. Regarding model_path, we’re happy with the default path and don’t need to specify it in this case.

>>> python .\train_rf.py 'iris.csv' --n_estimators 100 --max_depth 10 

This line will train our random forest classifier and, after a short while, you’ll see a file named model.pkl appear in your directory, which you can then use to validate your model on a test set or to produce predictions.

[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *