Table of contents
Introduction
1. Papermill and terminal-notifier
2. cron for Linux/macOS
3. launchd for macOS
Conclusion
[Update.1 : 2021–05–28]
Introduction
Do you have a Data Science project that requires your time every day? Do you use data feeds that update daily? For example, the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE updates daily and I use it in my personal project.
Manually, I start Jupyter, open a project, restart the Kernel and run all the cells, then git add/commit/push. It is a bit of work. In this article, I am going to share a step-by-step process to set up launchd
and cron
jobs for your data science project so that it will automatically update your project behind the scene and even notify you.
cron
for Linux/macOS and launched
for macOS
Although launchd is the preferred method in macOS, the
cron
method still works in macOS as well.
cron
is a Linux utility that schedules a command or script on your server/computer to run automatically at a specified time and date. A cron job is the scheduled task and it is very useful to automate repetitive tasks.
launchd is created by Apple and is a replacement for a lot of Unix tools, like cron
, inetd
, init
, etc.
You can start scheduling tasks and save a lot of your precious time after reading this article.
Step 0: Papermill and terminal-notifier
Papermill
Papermill is a tool for parameterizing and executing Jupyter Notebooks. I can use this to run a Jupyter Notebook file in a cron and launchd job file.
$ pip install papermill
or
$ pip3 install papermill
$ papermill --help
You can find command-line interface help here.
Papermill's usage:
papermill [OPTIONS] NOTEBOOK_PATH OUTPUT_PATH
I will show you an example soon.
terminal-notifier
terminal-notifier is a command-line tool to send macOS User Notifications. I will use this to notify me when the scheduled job is done.
Install terminal-notifier.
$ brew install terminal-notifier
$ terminal-notifier -help
cron for Linux/macOS
In macOS, you can run a background job on a timed schedule in two ways: launchd jobs and cron jobs. Note that it is still supported in macOS v10.15 even though cron
is not a recommended solution and launchd
has superseded.
Step 1: Setting up a cron job
You can set up your cron job using your user name:
$ whoami
your-name
$ sudo crontab -u your-name -e
Password:
sh-3.2#
You can enable the root user by using sudo su
in macOS so that you are not required your password.
$ sudo su
$ crontab -u your-name -e
-u
specifies the name of the user. -e
edits the current crontab.
Syntax
Add five numbers as described above and a path to a file you want to execute.
Example:
0 10 * * * ~/DataScience/covid-19-stats/covid19-cron
The above will run the file ~/DataScience/covid-19-stats/covid19-cron
every day at 10:00.
If the system is turned off or asleep, cron jobs do not execute. If you miss the designated time, it will exectute at the next designated time when your system is turned on.
You can output stdout
and stderr
:
# log stdout and stderr
42 6 * * * ~/DataScience/covid-19-stats/covid19-cron > /tmp/stdout.log 2> /tmp/stderr.log
>
redirect the standard output to /tmp/stdout.log
and >2
redirect the standard error to /tmp/stderr.log
.
Once you set up a cron job, you can list it:
$ crontab -l
0 20 * * * ~/DataScience/covid-19-stats/covid19-cron
If you want to remove all the cron job:
$ crontab -r
You can add multiple cron jobs in the crontab.
0 20 * * * ~/DataScience/covid-19-stats/covid19-cron
0 7 * * * Path/to/file/to/execute
0 7 * * 0 Path/to/another/file/to/execute
crontab guru is a quick and simple tool for the cron schedule.
Step 2: Writing a cron job
You can place all cron job files in a directory, but I place it in the project root. Change the current working directory to your project, create a cron job file, and open it in an editor. Executables should have no .sh
extension according to the Google style guides.
$ cd path/to/project
$ touch covid19-cron
$ vim covid19-cron
Step 3: Define shebang
The shebang used in the first line in scripts is to indicate the UNIX/Linux operating system for execution.
Even though Papermill and terminal-notifier work in a terminal, we need to add their paths.
Let's find them.
$ which papermill
/usr/local/bin/papermill
$ which terminal-notifier
/usr/local/bin/terminal-notifier
In my covid19-cron file:
#!/usr/bin/env bash
# run covid-19 files
# git add, comit and push
dir=/Users/shinokada/DataScience/covid-19-stats
papermill=/usr/local/bin/papermill
notifier=/usr/local/bin/terminal-notifier
cd $dir
$papermill covid-19-matplotlib.ipynb ./latest/covid-19-matplotlib.ipynb
# more files ...
$papermill covid-19-plotly.ipynb ./latest/covid-19-plotly.ipynb
git add .
git commit -m "update"
git push
$notifier -title Covid19 -subtitle "Daily Updated" -message "Completed" -open "https://mybinder.org/v2/gh/shinokada/covid-19-stats/master"
now=$(date)
echo "Cron job update completed at $now"
I create "latest" directory in the root directory. The Papermill outputs files to this "latest" directory. Since we are going to use git, you need to make sure that you have .git
in the project root.
If you are using %run somefile
, I suggest you add them to the cron-file.
I use title
, substitle
, message
and open
for the terminal-notifier options.
terminal-notifier quick guide
Step 4: Add permission to execute
This bash file needs permission to execute.
$ chmod u+x covid19-cron
chmod
sets permissions to files.
chmod u+x covid19-cron
allows the user to execute covid19-cron.
The above command is the same as:
$ chmod 744 covid19-cron
Step 5 Mail
Your terminal sends its output and error messages by mail after running a cron job. Let's check if the cron job worked.
$ mail
You need to press enter to read messages, and then q and enter to quit. Use j
to see the next lines. You need to check if the mail has no errors. In case of errors, you need to attend the problem.
Step 6 Testing a cron job
You need to reset the crontab time to test your cron job. launchd
allows us to test a job but for cron
this is the only way you can test.
$ sudo crontab -u your-name -e
# change time
5 20 * * * ~/DataScience/covid-19-stats/covid19-cron
$ crontab -l
5 20 * * * ~/DataScience/covid-19-stats/covid19-cron
When the test is done, it will display the notification.
launchd for macOS
launchd
is a unified, open-source service management framework for starting, stopping and managing daemons, applications, processes, and scripts.
If you schedule a launchd job by setting the StartCalendarInterval key and the computer is asleep when the job should have run, your job will run when the computer wakes up.
However, if the machine is off when the job should have run, the job does not execute until the next designated time occurs.
Step 1: plist file
A PLIST file is a system-wide and per-user daemon/agent configuration file. A daemon/agent is a program running in the background without user input. You define the name of the program, when you run it, what you want to run, etc. You store all your plist files in the ~/Library/LaunchAgents
directory.
[Update.1] If you don't have ~/Library/LaunchAgents
you need to create it.
# check ~/Library if it has LaunchAgents
$ ls ~/Library
# if not create the directory
$ mkdir ~/Library/LaunchAgents
Create a plist
file:
$ cd ~/Library/LaunchAgents
$ touch com.shinokada.covid19.plist
In the com.shinokada.covid19.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.shinokada.covid19</string>
<key>Program</key>
<string>/Users/shinokada/DataScience/covid-19-stats/covid19-launchd</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/bin:/usr/bin:/usr/local/bin</string>
</dict>
<key>StandardInPath</key>
<string>/tmp/covid.stdin</string>
<key>StandardOutPath</key>
<string>/tmp/covid.stdout</string>
<key>StandardErrorPath</key>
<string>/tmp/covid.stderr</string>
<key>WorkingDirectory</key>
<string>/Users/shinokada/DataScience/covid-19-stats</string>
<key>StartCalendarInterval</key>
<dict>
<key>Hour</key>
<integer>8</integer>
<key>Minute</key>
<integer>0</integer>
</dict>
</dict>
</plist>
Here I run /Users/shinokada/DataScience/covid-19-stats/covid19-launchd
at 8:00 AM every day.
Configuration in plist file quick guide
Step 2: Creating a bash file
Create a file called covid19-launchd in the project root directory. This is very similar to the above covid19-cron
.
#!/usr/bin/env bash
# run covid-19 files
# git add, comit and push
papermill covid-19-data.ipynb ./latest/covid-19-data.ipynb
papermill multiplot.ipynb ./latest/multiplot.ipynb
# more files ...
papermill uk-japan.ipynb ./latest/uk-japan.ipynb
papermill Dropdown-interactive.ipynb ./latest/Dropdown-interactive.ipynb
git add .
git commit -m "update"
git push
terminal-notifier -title Covid19 -subtitle "Daily Updated" -message "Completed" -open "https://mybinder.org/v2/gh/shinokada/covid-19-stats/master"
now=$(date)
echo "launchd update completed at $now"
Since we are setting PATH EnvironmentVariables
in the plist file, we don't need to worry about Papermill and terminal-notifier absolute paths.
You can test if it works by bash covid19-launchd
.
Step 3: Add permission to execute
This bash file needs permission to execute.
$ chmod u+x covid19-cron
Step 4: Testing launchd
lauchctl
controls the macOS launchd process. It has subcommand such as list
, start
, stop
, load
, unload
, etc.
For my case;
$ launchctl list | grep covid
- 0 com.shinokada.covid19
# test/debug
$ launchctl start com.shinokada.covid19
# if you need to stop
$ launchctl stop com.shinokada.covid19
# load the job
$ launchctl load ~/Library/LaunchAgents/com.shinokada.covid19.plist
# unload the job
$ launchctl unload ~/Library/LaunchAgents/com.shinokada.covid19.plist
# get help
$ launchctl help
Reloading
launchctl
does not have a reload command for reading changes to the config.plist file. Instead, you must unload and then load the plist file anew, e.g.:
$ launchctl unload ~/Library/LaunchAgents/com.shinokada.covid19.plist
$ launchctl load $_
$_
, like !$
, refers to the last argument of the previous command.
If you make any changes to the script or plist, make sure you unload and load the plist.
launchctl quick guide
launchctl
has many subcommands and the following diagram shows important ones.
Conclusion
Scheduled tasks save your time and easy to set up. You can set it up not only for your Data Science projects but also for your day-to-day work, such as updating node packages, homebrew formulae, etc. If you save 3 minutes a day, it will save more than 18 hours a year! If you are interested, you can see my sample project here.
Get full access to every story on Medium by becoming a member.