Table of contents
Introduction
1. Papermill and terminal-notifier
2. cron for Linux/macOS
3. launchd for macOS

Conclusion

[Update.1 : 2021–05–28]

Introduction

Do you have a Data Science project that requires your time every day? Do you use data feeds that update daily? For example, the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE updates daily and I use it in my personal project.

Manually, I start Jupyter, open a project, restart the Kernel and run all the cells, then git add/commit/push. It is a bit of work. In this article, I am going to share a step-by-step process to set up launchd and cron jobs for your data science project so that it will automatically update your project behind the scene and even notify you.

cron for Linux/macOS and launched for macOS

Although launchd is the preferred method in macOS, the cron method still works in macOS as well.

cron is a Linux utility that schedules a command or script on your server/computer to run automatically at a specified time and date. A cron job is the scheduled task and it is very useful to automate repetitive tasks.

launchd is created by Apple and is a replacement for a lot of Unix tools, like cron, inetd, init, etc.

You can start scheduling tasks and save a lot of your precious time after reading this article.

Step 0: Papermill and terminal-notifier

Papermill

None
papermill logo

Papermill is a tool for parameterizing and executing Jupyter Notebooks. I can use this to run a Jupyter Notebook file in a cron and launchd job file.

$ pip install papermill

or

$ pip3 install papermill
$ papermill --help

You can find command-line interface help here.

Papermill's usage:

papermill [OPTIONS] NOTEBOOK_PATH OUTPUT_PATH

I will show you an example soon.

terminal-notifier

None
terminal-notifier in action. Image by Author

terminal-notifier is a command-line tool to send macOS User Notifications. I will use this to notify me when the scheduled job is done.

Install terminal-notifier.

$ brew install terminal-notifier
$ terminal-notifier -help

cron for Linux/macOS

In macOS, you can run a background job on a timed schedule in two ways: launchd jobs and cron jobs. Note that it is still supported in macOS v10.15 even though cronis not a recommended solution and launchd has superseded.

Step 1: Setting up a cron job

You can set up your cron job using your user name:

$ whoami
your-name
$ sudo crontab -u your-name -e
Password:
sh-3.2#

You can enable the root user by using sudo su in macOS so that you are not required your password.

$ sudo su
$ crontab -u your-name -e

-u specifies the name of the user. -e edits the current crontab.

Syntax

None
cron syntax guide. Image by Author

Add five numbers as described above and a path to a file you want to execute.

Example:

0 10 * * * ~/DataScience/covid-19-stats/covid19-cron

The above will run the file ~/DataScience/covid-19-stats/covid19-cron every day at 10:00.

If the system is turned off or asleep, cron jobs do not execute. If you miss the designated time, it will exectute at the next designated time when your system is turned on.

You can output stdout and stderr:

# log stdout and stderr
42 6 * * * ~/DataScience/covid-19-stats/covid19-cron > /tmp/stdout.log 2> /tmp/stderr.log

> redirect the standard output to /tmp/stdout.log and >2 redirect the standard error to /tmp/stderr.log.

Once you set up a cron job, you can list it:

$ crontab -l
0 20 * * * ~/DataScience/covid-19-stats/covid19-cron

If you want to remove all the cron job:

$ crontab -r

You can add multiple cron jobs in the crontab.

0 20 * * * ~/DataScience/covid-19-stats/covid19-cron
0 7 * * * Path/to/file/to/execute
0 7 * * 0 Path/to/another/file/to/execute

crontab guru is a quick and simple tool for the cron schedule.

Step 2: Writing a cron job

You can place all cron job files in a directory, but I place it in the project root. Change the current working directory to your project, create a cron job file, and open it in an editor. Executables should have no .sh extension according to the Google style guides.

$ cd path/to/project
$ touch covid19-cron
$ vim covid19-cron

Step 3: Define shebang

The shebang used in the first line in scripts is to indicate the UNIX/Linux operating system for execution.

Even though Papermill and terminal-notifier work in a terminal, we need to add their paths.

Let's find them.

$ which papermill
/usr/local/bin/papermill
$ which terminal-notifier
/usr/local/bin/terminal-notifier

In my covid19-cron file:

#!/usr/bin/env bash
# run covid-19 files 
# git add, comit and push

dir=/Users/shinokada/DataScience/covid-19-stats
papermill=/usr/local/bin/papermill
notifier=/usr/local/bin/terminal-notifier

cd $dir
$papermill covid-19-matplotlib.ipynb ./latest/covid-19-matplotlib.ipynb
# more files ...
$papermill covid-19-plotly.ipynb ./latest/covid-19-plotly.ipynb 
git add . 
git commit -m "update" 
git push
$notifier -title Covid19 -subtitle "Daily Updated" -message "Completed" -open "https://mybinder.org/v2/gh/shinokada/covid-19-stats/master"
now=$(date)
echo "Cron job update completed at $now"

I create "latest" directory in the root directory. The Papermill outputs files to this "latest" directory. Since we are going to use git, you need to make sure that you have .git in the project root.

If you are using %run somefile, I suggest you add them to the cron-file.

I use title, substitle, message and open for the terminal-notifier options.

terminal-notifier quick guide

None
terminal-notifier quick guide. Image by Author

Step 4: Add permission to execute

This bash file needs permission to execute.

$ chmod u+x covid19-cron

chmod sets permissions to files.

None
chmod user guide. Image by Author
None
chmod action guide. Image by Author
None
chmod permission guide. Image by Author

chmod u+x covid19-cron allows the user to execute covid19-cron.

The above command is the same as:

$ chmod 744 covid19-cron

Step 5 Mail

Your terminal sends its output and error messages by mail after running a cron job. Let's check if the cron job worked.

$ mail

You need to press enter to read messages, and then q and enter to quit. Use j to see the next lines. You need to check if the mail has no errors. In case of errors, you need to attend the problem.

Step 6 Testing a cron job

You need to reset the crontab time to test your cron job. launchd allows us to test a job but for cron this is the only way you can test.

$ sudo crontab -u your-name -e
# change time 
5 20 * * * ~/DataScience/covid-19-stats/covid19-cron
$ crontab -l
5 20 * * * ~/DataScience/covid-19-stats/covid19-cron

When the test is done, it will display the notification.

None
terminal-notification. Image by Author

launchd for macOS

launchd is a unified, open-source service management framework for starting, stopping and managing daemons, applications, processes, and scripts.

If you schedule a launchd job by setting the StartCalendarInterval key and the computer is asleep when the job should have run, your job will run when the computer wakes up.

However, if the machine is off when the job should have run, the job does not execute until the next designated time occurs.

Step 1: plist file

A PLIST file is a system-wide and per-user daemon/agent configuration file. A daemon/agent is a program running in the background without user input. You define the name of the program, when you run it, what you want to run, etc. You store all your plist files in the ~/Library/LaunchAgents directory.

[Update.1] If you don't have ~/Library/LaunchAgents you need to create it.

# check ~/Library if it has LaunchAgents
$ ls ~/Library
# if not create the directory
$ mkdir ~/Library/LaunchAgents

Create a plist file:

$ cd ~/Library/LaunchAgents
$ touch com.shinokada.covid19.plist

In the com.shinokada.covid19.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
    <dict>
        <key>Label</key>
        <string>com.shinokada.covid19</string>
        <key>Program</key>
        <string>/Users/shinokada/DataScience/covid-19-stats/covid19-launchd</string>
        <key>EnvironmentVariables</key>
        <dict>
            <key>PATH</key>
            <string>/bin:/usr/bin:/usr/local/bin</string>
        </dict>
        <key>StandardInPath</key>
        <string>/tmp/covid.stdin</string>
        <key>StandardOutPath</key>
        <string>/tmp/covid.stdout</string>
        <key>StandardErrorPath</key>
        <string>/tmp/covid.stderr</string>
        <key>WorkingDirectory</key>
        <string>/Users/shinokada/DataScience/covid-19-stats</string>
        <key>StartCalendarInterval</key>
        <dict>
            <key>Hour</key>
            <integer>8</integer>
            <key>Minute</key>
            <integer>0</integer>
        </dict>
    </dict>
</plist>

Here I run /Users/shinokada/DataScience/covid-19-stats/covid19-launchd at 8:00 AM every day.

Configuration in plist file quick guide

None
Configuration in plist file quick guide. More at launchd configuration. Image by Author

Step 2: Creating a bash file

Create a file called covid19-launchd in the project root directory. This is very similar to the above covid19-cron.

#!/usr/bin/env bash
# run covid-19 files 
# git add, comit and push
papermill covid-19-data.ipynb ./latest/covid-19-data.ipynb
papermill multiplot.ipynb ./latest/multiplot.ipynb 
# more files ...
papermill uk-japan.ipynb ./latest/uk-japan.ipynb 
papermill Dropdown-interactive.ipynb ./latest/Dropdown-interactive.ipynb
git add . 
git commit -m "update" 
git push
terminal-notifier -title Covid19 -subtitle "Daily Updated" -message "Completed" -open "https://mybinder.org/v2/gh/shinokada/covid-19-stats/master"
now=$(date)
echo "launchd update completed at $now"

Since we are setting PATH EnvironmentVariables in the plist file, we don't need to worry about Papermill and terminal-notifier absolute paths.

You can test if it works by bash covid19-launchd.

Step 3: Add permission to execute

This bash file needs permission to execute.

$ chmod u+x covid19-cron

Step 4: Testing launchd

lauchctl controls the macOS launchd process. It has subcommand such as list, start, stop, load, unload, etc.

For my case;

$ launchctl list | grep covid
-  0  com.shinokada.covid19
# test/debug 
$ launchctl start com.shinokada.covid19
# if you need to stop
$ launchctl stop com.shinokada.covid19
# load the job
$ launchctl load ~/Library/LaunchAgents/com.shinokada.covid19.plist
# unload the job
$ launchctl unload ~/Library/LaunchAgents/com.shinokada.covid19.plist
# get help
$ launchctl help
None
Image by Author

Reloading

launchctl does not have a reload command for reading changes to the config.plist file. Instead, you must unload and then load the plist file anew, e.g.:

$ launchctl unload ~/Library/LaunchAgents/com.shinokada.covid19.plist
$ launchctl load $_

$_, like !$, refers to the last argument of the previous command.

If you make any changes to the script or plist, make sure you unload and load the plist.

launchctl quick guide

launchctl has many subcommands and the following diagram shows important ones.

None
Quick guide for launchctl. Image by Author

Conclusion

Scheduled tasks save your time and easy to set up. You can set it up not only for your Data Science projects but also for your day-to-day work, such as updating node packages, homebrew formulae, etc. If you save 3 minutes a day, it will save more than 18 hours a year! If you are interested, you can see my sample project here.

Get full access to every story on Medium by becoming a member.

None
Please subscribe.

References