Getting Started on Cloud Computing

What is Cloud Computing

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is cloud computing and how can I use it?

  • What are common cloud services available?

Objectives
  • Learn some historic developments that lead to cloud computing

  • Understand the main drivers of cloud computing

  • Know typical use cases of cloud computing relevant for your work

In the early days of computing, computers were called mainframes and filled a big room. To let the perform computations, you have to feed them punch-cards, so most of the time it’s users were busy sitting behind a desk punching holes in pieces of paper. When they were done, they would leave the punch cards in a mailbox, and after some time, receive the results of the computation (if no error occurred) printed on paper from another mailbox. Since these computers were very large and expensive, many different users

IBM 704 mainframe (1964)
IBM 704 mainframe (1964). Courtesy of Lawrence Livermore National Laboratory

For the users of these computers, the computer terminal with a screen and keyboard was a very welcome innovation. As computer were still expensive, multiple terminals were connected to the same mainframe. All data was stored and computations happened on the mainframe, with the terminals only serving as endpoints providing a keyboard instead of punch cards and a screen instead of text printed on paper. In modern day computing, we still see the word terminal appear, but usually as an application that can be used to interact with a computer system.

DEC VT100 terminal (1978)
DEC VT100 terminal (1978). Courtesy of Jason Scott (CC BY 2.0)

By the 1980’s computers got small enough to sit on a desk, and personal computers emerged, machines not shared by many users but dedicated to a single person or family. For some decades, this was a common way people used computers. Even though big companies still used centralized main-frames for important databases, for most people networking was still expensive and most computers were used in a stand alone way and data was most commonly transferred between computers using physical data carriers.

40 Years of Removable Storage
40 Years of Removable Storage, an important way to transport data between computers. Courtesy of avaragado (CC BY 2.0)

In the first decade of the 2000’s, increasing access to (wireless) internet and the introduction of smart phones resulted in an explosion of more centralized services, such as web-mail, maps and photo albums. This made it easier for users to access their data and the same service from different computing devices, without the need to transfer data between devices using physical storage. An added advantage of this centralization is that you have a single truth. If you have two different versions with on two different devices, it can be difficult to know which one is the best one, but with a centralized approach you have a single truth.

The advantage of this kind of centralization is that it becomes possible to outsource the configuration and management of computers and software. While it is possible to set up your own private mail service, doing so takes a lot of effort, so most individuals outsource this to Google, Microsoft or their Internet Service Provider. The ability to outsource complicated IT management tasks to companies who have a lot more knowledge and experience is potentially very attractive to many (smaller) companies, as they can avoid having to set up their own IT department and focus on their core business competencies.

While access to the internet became more and more widespread, computers became so powerful that their resources were mostly idle during daily office work, and operating systems were improved to be able to run many different programs in parallel. This innovation was taken to the extreme by the development of virtualization software which made it possible to let a big and powerful computer behave as if it were a number of smaller virtual computers, allocating computational resources to the virtual computer that requires it the most. This provided significant economy of scale benefits, in particular to companies that already had big data-centers full with computers providing services to internet users.

Hardware Virtualization Overview
Diagram showing an example of hardware virtualization: one physical computer behaves as if it were three virtual computers.

Arguably, the advantages offered by centralization, the easy of use thanks to outsourcing and the scalability offered by virtualization, are all important drivers of the success of the cloud. In the last decade, a plethora of cloud services offered by all kinds of businesses has emerged. As of 2021, major cloud service providers such as Amazon Web Services and Microsoft Azure offer hundreds of different cloud services to their customers. To make sense of this scala of services, they are often categorized by Something-as-a-Service labels. Full featured, ready to use services like Google Docs, Office 365 or Overleaf are often denoted by the Software-as-a-Service label, abbreviated to SaaS.

In this workshop, we focus on the other end of the spectrum, often denoted as Infrastructure-as-a-Service (IaaS). An important part of this is the ability to rent a (virtualized) computer running in the cloud with little effort, paying only for the time the computer is running. It is like renting a server that you can use for whatever purpose you see fit, paying only for it while it is turned on. This particular type of service is typically called a Virtual Machine (VM) or Virtual Private Server (VPS), and is offered by most cloud providers. Many of the other cloud services are built on top of this type of service, and typically focus on a more specific task such as Data Science, Machine Learning, File Storage, Databases, Web Services, etcetera. Many of these types of services consist of a VPS with some additional software pre-configured to facilitate a particular task, i.e. a Data Science cloud service cloud consist of a VPS that comes with R and Python and the most popular packages pre-installed, saving you the hassle of having to install them yourself.

Some cloud providers

Below is a list of some well-known cloud providers and useful services offered by them.

There are many other smaller providers that provide VPS or IaaS kind of services. Furthermore, other types of cloud services that can be interesting to consider are:

  • Google Colab is a specialized cloud service that focuses on writing and running Jupyter notebooks for free, including access to GPU’s.
  • RStudio Cloud is a specialized service for running RStudio and R on a cloud instance, which can be used from your browser.

Key Points

  • Explain concepts of centralization, outsourcing and virtualization

  • Discuss different cloud service models such as IaaS and SaaS

  • Name the important cloud service providers such as Amazon and Microsoft


Setting up a machine in the cloud

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How do I create and configure a virtual machine I can control

  • How can I clean up my virtual machine when I’m done with it

Objectives
  • Get an idea of how you create a new virtual machine, for example in Microsoft Azure

  • Know what settings and options are important to consider when creating a new virtual machine

  • Be aware how you clean up to avoid unexpected costs

You do not need to perform these steps during this workshop!

The steps in this lesson are only meant to discuss how you can start up your own virtual machine. In this lesson, your instructor will have done these already, and just provide you with a username and password that you can work with. The reason we still discuss it is to give you an idea of what needs to be done to start up the computer.

As a first step, we will start a Virtual Machine in the cloud. In this example we will use Microsoft Azure, but if you use another provider (e.g. Amazon, Google) the procedure will be similar. A list of providers you can consider was provided in the first lesson, but these steps should be quite similar for almost all cloud providers and hosters who offer Virtual Private Servers (VPS).

Starting a new Virtual Machine

After you set up an account (and possibly configure how billing works), most cloud providers have some portal or dashboard. There you can typically find some option to create a new Virtual Machine (or whatever the service is called by your cloud provider of choice).

Creating a new VM from the Azure Portal
Creating a VM from the Azure portal

When you want to start a new instance, there are typically a number of things you need to choose and configure. Typical things you can choose are:

The number of vCPU’s typically determines how many computations you can run in parallel. Most current day laptop have 2 or 4 CPU cores, an 8GB to 16GB of RAM. However, if your program is unable to make use of multiple cores (as most programs are), having a single vCPU would be sufficient unless you want to run the same program multiple times in parallel.

Furthermore, you typically also can assign a name to the Virtual Machine, and link it to some billing account. In the Azure portal, choose which type of Virtual Machine can be done under the size setting:

Configure the properties of a new VM
Configuring the properties of a new VM in Azure

With Azure, you can choose between Windows Server or Linux based virtual machine. In this workshop, we will go with Linux.

Which Operating System to Use

If you are familiar with Windows, and want to run a program that you use on your own Windows computer, you can consider using a Windows VM. Remote control of such virtual machines usually happens via Remote Desktop, which behaves very similar to working on a local Windows computer. However, Windows Virtual Machines are more expensive due to licensing costs, not all cloud providers offer them and for servers, Linux is in general more popular for servers. There are estimates that over 90% of the servers in the world and in cloud computing run Linux, and even on Microsoft’s Azure cloud platform Linux virtual machines are more popular than Windows virtual machines.

There are many different Linux based operating systems, often called distributions. For server computing, some notable ones are Ubuntu, Debian, Red Hat, CentOS an SUSE. For beginners, Ubuntu and Debian are perhaps easier to start with. As Ubuntu offers a bit more recent packages, whereas Debian favors stability, we will work with Ubuntu in this lesson. However, almost all the things you learn will be applicable in any kind of Linux/UNIX system. The major differences turn up when you want to install software. Ubuntu and Debian use that apt package manager, whereas Red Hat and CentOS use the yum package manager. Package managers are like app stores: they make it really easy to install new software on a system.

The next step is to set up a user. This will typically be an administrator account that we can use to connect to and manage the machine. We also want to enable SSH access to the machine (port 22 should be open, and not blocked by a firewall), so we can connect to the machine and login as this user.

Configuring an administrative user
Configuring an administrative user for the Virtual Machine

Security is important!

Please take care in securing your virtual machine, even if you only use it for simple computations. For hackers and criminals, controlling machines can help them do all kinds of malicious things, and if this happens on a virtual machine you have created this can cause you trouble. Therefore, you should always us a strong, unique password for an account with administrator privileges and keep that password safe. It would be even better to consider using private key authentication, if possible.

Finally, your virtual machine needs storage space (the virtual equivalent of a hard drive). Unless you need to store vast amounts of data, you can typically stick with a default amount (on a Linux system something like 20GB should be sufficient for many use cases). Typically, we can also add additional storage, but for this lesson we do not need to and stick to the default.

Configuring storage on the Virtual Machine
Configuring storage on the Virtual Machine

Once we have configured CPU’s, memory, storage, the operating system and a standard user, we are ready to go. Review other settings, but you will probably be fine with the defaults. At the end, you will problably have to confirm that you want to start up the Virtual Machine.

Confirm the creation of the Virtual Machine
Confirm the creation of the Virtual Machine

It may take a little bit of time before the machine is configured at has started, but at some point you should get a notification that your machine is ready to use.

Our Virtual Machine is ready to use!
Our Virtual Machine is ready to use!

Now that our Virtual Machine is started, we should connect to it. Typically, when we navigate to the Virtual Machine in the portal of the cloud provider, we should see some information needed to connect to it. In Azure, there is a nice landing page that contains multiple option, include a Connect button that will give you more information on how to connect, the option to Stop or Delete the virtual machine, as well as the IP address of the virtual machine.

Landing page of the virtual machine with management options.
Landing page of the virtual machine

The IP-adress is important information that will allow you to connect to the remote computer. In case you created a Windows based virtual machine, you can enter this into a Remote Desktop Connection (RDP) client to connect to the computer. In case of a Linux virtual machine, you need to this connect to enter this into a Secure Shell (SSH) client to connect and control the computer.

In the next episode we use this information to connect to the virtual machine.

Cleaning Up

At some point, you are done with your virtual machine and do not need it any more, you should shut it down and delete it. Typically, you can do so via the management page of the virtual machine, for example the one you can see in the figure above. Note that there may be resources related to your machine that are not deleted automatically once you delete te machine. It is a good idea to check if all resources have been removed and deleted to avoid unexpected costs and billing!

Avoid unexpected billing costs!

Some cloud providers, including the large ones such as AWS, Azure and Google cloud bill you for different resources separately. Things that may be billed separately can be:

  • Running virtual machines
  • Disks and storage space
  • Networking (traffic, IP addresses, hostnames)
  • Monitoring and Analytics services

Once you are done with your virtual machine it is thus very wise to double check if all (billable) resources are deleted from your account. If you are not going to use cloud services for a while, you can even consider to remove all services in your account just to be sure.

Key Points

  • Understand that CPU, RAM, Disk Space, Operating System and a administrative user are needed when you create a new virtual machine

  • Understand that a hostname and/or IP address is needed to connect to a virtual machine, either using Secure Shell (SSH) or Remote Desktop (RDP)

  • Be aware that you delete/stop all resources when you are finished with you virtual machine, to avoid unexpected costs.


Connecting to a remote system

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How do I open a terminal?

  • How do I connect to a remote computer?

Objectives
  • Connect to a remote computer

Opening a Terminal

Connecting to a Linux system is most often done through a protocol known as “SSH” (Secure Shell). On Mac and Linux, you typically use SSH through a terminal, an application that offers a similar experience to the old-fashioned experience of sitting behind a real terminal, with the advantage that it is very easy to have multiple terminal windows open at the same time.

On Windows, the easiest option is to install and use Putty from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html. (you probably want the 64-bit x86 installer), which is purely an SSH client, intended to connect to remote computers, and which does not offer the option to work with the shell locally. Alternatively, you can use a Linux terminal from windows using the Window Subsystem for Linux or the Git Bash shell that comes with Git for Windows, in which case the process is very similar to the process for Linux users.

Mac

Macs have had a terminal built in since the first version of OS X since it is built on a UNIX-like operating system, leveraging many parts from BSD (Berkeley Software Distribution). The terminal can be quickly opened through the use of the Searchlight tool. Hold down the command key and press the spacebar. In the search bar that shows up type “terminal”, choose the terminal app from the list of results (it will look like a tiny, black computer screen) and you will be presented with a terminal window. Alternatively, you can find Terminal under “Utilities” in the Applications menu.

Linux

There are many different versions (aka “flavours”) of Linux and how to open a terminal window can change between flavours. Fortunately most Linux users already know how to open a terminal window since it is a common part of the workflow for Linux users. If this is something that you do not know how to do then a quick search on the Internet for “how to open a terminal window in” with your particular Linux flavour appended to the end should quickly give you the directions you need.

Logging onto the system

With all of this in mind, let’s connect to a remote HPC system. In this workshop, we will connect to the cloud system. For this, you need the system’s IP address or hostname, which you either can get from the virtual machine’s landing page in the portal of your cloud provider. During the workshop, your instructor will provide you with this information. Additionally, you need a user name and password to be able to login to the remote computer.

Entering a password

When you are entering a password in a terminal, no output is displayed. If you are used to password prompts that show dots or stars indicated you typed something, this can be a bit unsettling, but it is perfectly normal. Just press when you are done, and you should get a message indicating if your password is correct.

One reason it works this way is that less information is leaked: someone who peeks at your screen while you enter your password will not even know how long your password is!

Go to the section that fits your operating system:

Log in using Windows and Putty

When you start up Putty, the first thing it will ask you for is a Host name or IP address. At this point you should have received this information about the computer you want to connect to, and fill it in.

The screen PuTTY opens with
The starting screen of PuTTY can be used to set up an ssh-connection to a remote computer

To get started, everything you need to do is fill in the hostname or ip address in the designated field, and click Open. If it is the first time of you connecting to this server, it will show the following popup

PuTTY prompt to trust the server's security certificate
PuTTY prompt to trust the server's security certificate.

Accept this certificate.

Server certificate

The first time you connect to a new computer, your ssh client checks the identity of the server based on it’s certificate. Typically, your client will store this certificate so the next time you connect, it does not have to ask you about the certificate any more. If the certificate of the server would change, that would mean that you are either communicating with a different virtual machine than you were before, or that you reinstalled the operating on the server so it generated a new certificate.

If all went well, you should be logged in now. Continue to the success part of this episode.

Log in using the Mac OS terminal application

Once you have started the Terminal application, you can either type in the ssh command. If you prefer that, read the Linux section of this episode.

To set up a remote connecting with a GUI dialog, choose the New Remote Connection… option in the Shell menu of the Terminal application, as can be seen below:

Set up a remote connecting with the Mac OS terminal application
Set up a remote connecting with the Mac OS terminal application

You should then click *Secure Shell (ssh) and click the + button to add a server. Add the IP address of the server you want to connect to, and fill in the user name you want to use to connect to the server. You should be able to see that a command is constructed that looks something like ssh username@the.server.address, as in the example below:

Setting up a remote connecting with an address and user name in the Mac OS terminal application.
Setting up a remote connecting with an address and user name in the Mac OS terminal application.

Finally, the first time you do this you will see a prompt similar to the prompt below asking you to add a certificate for the server you connect to. Accept the certificate by typing yes and pressing .

Prompt to accept a host fingerprint/certificate the first time  you connect to a new server
Prompt to accept a host fingerprint/certificate the first time you connect to a new server

Finally, you should enter your password. Remember that anything you enter will remain invisible (not dots or stars are shown). You can use a right-click to enter a password. Once you have entered the password, press to enter it.

If all went well, you should be logged in now. Continue to the success part of this episode.

Log in from a local bash terminal (Linux)

SSH allows us to connect to UNIX computers remotely, and use them as if they were our own. The general syntax of the connection command follows the format

ssh yourUsername@the.server.address

The first time you connect to a new computer, you may have to accept the security certificate of the server. Accept the certificate by typing yes and pressing . Typically, your client will thenstore this certificate so the next time you connect, it does not have to ask you about the certificate any more. If the certificate of the server would change, that would mean that you are either communicating with a different virtual machine than you were before, or that you reinstalled the operating on the server so it generated a new certificate.

When successfully logged in

If you’ve connected successfully, you should see a welcome message. On a basic Azure Ubuntu virtual machine, the message looks something as follows:

PuTTY example of a successful login
Example of a succesful login with a typical welcome message from Ubuntu

It may contain some more private information (such as the last IP address that connected with your user account). Note that at the end it shows a prompt:

username@machine-name:~$

The $ indicates that the server is waiting for you to type a new command for it to execute. You’re connected and ready to go!

Transferring files to and from the remote computer

Now that you have connected to the remote computer, you may wonder how you can transfer files to and from the remote computer, including your own programs, data files, and results of experiments that ran on the virtual machine. This can typically be done using a SFTP client and the same credentials you use to login with SSH (SFTP stands for SSH File Transfer Protocol). It can be very helpful to use a graphical client for this, and some good free options are WinSCP for Windows users, and FileZilla for all platforms. If you struggle connecting, make sure that the SFTP protocol is selected.

Key Points

  • To connect to a remote computer system using SSH and a password, use a tool with graphical configuration (PuTTY, Mac OS Terminal) or run ssh yourUsername@remote.computer.address in an existing bash command line.


Moving around and looking at things

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How do I navigate and look around the system?

Objectives
  • Learn how to navigate around directories and look at their contents

  • Explain the difference between a file and a directory.

  • Translate an absolute path into a relative path and vice versa.

  • Identify the actual command, flags, and filenames in a command-line call.

  • Demonstrate the use of tab completion, and explain its advantages.

Important Tips

In this episode you start working with the bash shell. There are two very useful key combinations to know before you start.

  • You can go back to (and edit) the previous commands you typed with . If you go too far in history, you can move forward in history with . Using these keys well will avoid a lot of type, in particular if you introduce typos, or want to make small adjustments to a command!
  • Commands and filenames can often be autocompleted by pressing Tab. If you have to type in the name of a file this_is_a_super_long_filename_i_would_hate_to_have_to_type_it_all, you can type a small part of it, e.g. this and press tab. Typically, bash will auto-complete the filename for you, if this is possible. If there is some ambiguity, it will only autocomplete up to the part where there is no ambiguity. Try pressing Tab when you can, and you’ll get the hang of it!
  • In many terminal applications, pasting is done by doing a right-click with your mouse, so the standard shortcut CTRL+V is often not necessary. Similarly, copying text to the clipboard is often done by just selecting a piece of text in the terminal, and the standard shortcut CTRL+C is often not necessary.

At this point in the lesson, we’ve just logged into the system. Nothing has happened yet, and we’re not going to be able to do anything until we learn a few basic commands. By the end of this lesson, you will know how to “move around” the system and look at what’s there.

Right now, all we see is something that looks like this (assuming test01 is the username and Cloud-Workshop-VM is the hostname of the machine):

test01@Cloud-Workshop-VM:~$

The dollar sign is a prompt, which shows us that the shell is waiting for input; your shell may use a different character as a prompt and may add information before the prompt. When typing commands, either from these lessons or from other sources, do not type the prompt, only the commands that follow it.

Type the command whoami, then press the Enter key (sometimes marked Return) to send the command to the shell. The command’s output is the ID of the current user, i.e., it shows us who the shell thinks we are:

$ whoami
yourUsername

More specifically, when we type whoami the shell:

  1. finds a program called whoami,
  2. runs that program,
  3. displays that program’s output, then
  4. displays a new prompt to tell us that it’s ready for more commands.

Next, let’s find out where we are by running a command called pwd (which stands for “print working directory”). (“Directory” is another word for “folder”). At any moment, our current working directory (where we are) is the directory that the computer assumes we want to run commands in unless we explicitly specify something else. Here, the computer’s response is /home/yourUsername, which is yourUsername home directory. Note that the location of your home directory may differ from system to system.

$ pwd
/home/yourUsername

So, we know where we are. How do we look and see what’s in our current directory?

$ ls

ls prints the names of the files and directories in the current directory in alphabetical order, arranged neatly into columns.

examples  welcome.txt

If nothing shows up when you run ls, it means that nothing’s there. Let’s make a directory for us to play with.

mkdir <new directory name> makes a new directory with that name in your current location. Notice that this command required two pieces of input: the actual name of the command (mkdir) and an argument that specifies the name of the directory you wish to create.

$ mkdir documents

Let’s use ls again. What do we see?

Our folder is there, awesome. What if we wanted to go inside it and do stuff there? We will use the cd (change directory) command to move around. Let’s cd into our new documents folder.

$ cd documents
$ pwd
~/documents

What is the ~ character? When using the shell, ~ is a shortcut that represents /home/yourUserName.

Now that we know how to use cd, we can go anywhere. That’s a lot of responsibility. What happens if we get “lost” and want to get back to where we started?

To go back to your home directory, the following three commands will work:

$ cd /home/yourUserName
$ cd ~
$ cd

A quick note on the structure of a UNIX (Linux/Mac/Android/Solaris/etc) filesystem. Directories and absolute paths (i.e. exact position in the system) are always prefixed with a /. / by itself is the “root” or base directory.

Let’s go there now, look around, and then return to our home directory.

$ cd /
$ ls
$ cd ~
bin   etc   lib64       mnt   root  snap  tmp  vmlinuz
boot  home  lost+found  opt   run   srv   usr  vmlinuz.old
dev   lib   media       proc  sbin  sys   var

The “home” directory is the one where we generally want to keep all of our files. Other folders on a UNIX OS contain system files, and get modified and changed as you install new software or upgrade your OS.

Difference between Windows and UNIX

No Drive Letters

The Folder structure on UNIX system is a bit different from the Windows structure. The are no drive letters such as C: or D: for different drives, but all paths start with /. If you use Linux on a desktop computer, a connected USB stick would typically be accessed via a path such as /media/usb/ rather than via a drive letter.


Upper and Lower Case names are different

On UNIX based systems, file names are case-sensitive, which means that the upper-case version of a letter is considered different from the lower-case version. That means that on a UNIX system, a folder can contain two separate files name myFile and MyFile. On Windows, those two filenames would be considered equal, and could not co-exist in the same folder.

There are several other useful shortcuts you should be aware of.

Let’s try these out now:

$ cd ./documents
$ pwd
$ cd ..
$ pwd
/home/yourUserName/documents
/home/yourUserName

Many commands also have multiple behaviours that you can invoke with command line ‘flags.’ What is a flag? It’s generally just your command followed by a ‘-‘ and the name of the flag (sometimes it’s ‘–’ followed by the name of the flag). You follow the flag(s) with any additional arguments you might need.

We’re going to demonstrate a couple of these “flags” using ls.

Show hidden files with -a. Hidden files are files that begin with ., these files will not appear otherwise, but that doesn’t mean they aren’t there! “Hidden” files are not hidden for security purposes, they are usually just config files and other tempfiles that the user doesn’t necessarily need to see all the time.

$ ls -a
.  ..  .bash_logout  .bash_profile  .bashrc  documents  .emacs  .mozilla  .ssh

Notice how both . and .. are visible as hidden files. Show files, their size in bytes, date last modified, permissions, and other things with -l.

$ ls -l
drwxr-xr-x 2 yourUsername tc001 4096 Jan 14 17:31 documents

This is a lot of information to take in at once, but we will explain this later! ls -l is extremely useful, and tells you almost everything you need to know about your files without actually looking at them.

We can also use multiple flags at the same time!

$ ls -l -a
$ ls -la
total 36
drwx--S--- 5 yourUsername tc001 4096 Nov 28 09:58 .
drwxr-x--- 3 root         tc001 4096 Nov 28 09:40 ..
-rw-r--r-- 1 yourUsername tc001   18 Dec  6  2016 .bash_logout
-rw-r--r-- 1 yourUsername tc001  193 Dec  6  2016 .bash_profile
-rw-r--r-- 1 yourUsername tc001  231 Dec  6  2016 .bashrc
drwxr-sr-x 2 yourUsername tc001 4096 Nov 28 09:58 documents
-rw-r--r-- 1 yourUsername tc001  334 Mar  3  2017 .emacs
drwxr-xr-x 4 yourUsername tc001 4096 Aug  2  2016 .mozilla
drwx--S--- 2 yourUsername tc001 4096 Nov 28 09:58 .ssh

Flags generally precede any arguments passed to a UNIX command. ls actually takes an extra argument that specifies a directory to look into. When you use flags and arguments together, the syntax (how it’s supposed to be typed) generally looks something like this:

$ command <flags/options> <arguments>

So using ls -l -a on a different directory than the one we’re in would look something like:

$ ls -l -a ~/documents
drwxr-sr-x 2 yourUsername tc001 4096 Nov 28 09:58 .
drwx--S--- 5 yourUsername tc001 4096 Nov 28 09:58 ..

Where to go for help?

How did I know about the -l and -a options? Is there a manual we can look at for help when we need help? There is a very helpful manual for most UNIX commands: man (if you’ve ever heard of a “man page” for something, this is what it is).

$ man ls
LS(1)                          User Commands                          LS(1)

NAME
     ls - list directory contents

SYNOPSIS
     ls [OPTION]... [FILE]...

DESCRIPTION
     List  information  about the FILEs (the current directory by default).
     Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.

     Mandatory arguments to long options are mandatory for short options too.

To navigate through the man pages, you may use the and keys to move line-by-line, or try the (spacebar) and b keys to skip up and down by full page. Quit the man pages by pressing q.

Alternatively, most commands you run will have a --help option that displays addition information For instance, with ls:

$ ls --help
Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.

Mandatory arguments to long options are mandatory for short options too.
  -a, --all                  do not ignore entries starting with .
  -A, --almost-all           do not list implied . and ..
      --author               with -l, print the author of each file
  -b, --escape               print C-style escapes for nongraphic characters
      --block-size=SIZE      scale sizes by SIZE before printing them; e.g.,
                               '--block-size=M' prints sizes in units of
                               1,048,576 bytes; see SIZE format below
  -B, --ignore-backups       do not list implied entries ending with ~

# further output omitted for clarity

Unsupported command-line options

If you try to use an option that is not supported, ls and other programs will print an error message similar to this:

[remote]$ ls -j
ls: invalid option -- 'j'
Try 'ls --help' for more information.

Looking at documentation

Looking at the man page for ls or using ls --help, what does the -h (--human-readable) option do?

Absolute vs Relative Paths

Starting from /Users/amanda/data/, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda?

  1. cd .
  2. cd /
  3. cd /home/amanda
  4. cd ../..
  5. cd ~
  6. cd home
  7. cd ~/data/..
  8. cd
  9. cd ..

Solution

  1. No: . stands for the current directory.
  2. No: / stands for the root directory.
  3. No: Amanda’s home directory is /Users/amanda.
  4. No: this goes up two levels, i.e. ends in /Users.
  5. Yes: ~ stands for the user’s home directory, in this case /Users/amanda.
  6. No: this would navigate into a directory home in the current directory if it exists.
  7. Yes: unnecessarily complicated, but correct.
  8. Yes: shortcut to go back to the user’s home directory.
  9. Yes: goes up one level.

Relative Path Resolution

Using the filesystem diagram below, if pwd displays /Users/thing, what will ls -F ../backup display?

  1. ../backup: No such file or directory
  2. 2012-12-01 2013-01-08 2013-01-27
  3. 2012-12-01/ 2013-01-08/ 2013-01-27/
  4. original/ pnas_final/ pnas_sub/

File System for Challenge Questions

Solution

  1. No: there is a directory backup in /Users.
  2. No: this is the content of Users/thing/backup, but with .. we asked for one level further up.
  3. No: see previous explanation.
  4. Yes: ../backup/ refers to /Users/backup/.

ls Reading Comprehension

Assuming a directory structure as in the above Figure (File System for Challenge Questions), if pwd displays /Users/backup, and -r tells ls to display things in reverse order, what command will display:

pnas_sub/ pnas_final/ original/
  1. ls pwd
  2. ls -r -F
  3. ls -r -F /Users/backup
  4. Either #2 or #3 above, but not #1.

Solution

  1. No: pwd is not the name of a directory.
  2. Yes: ls without directory argument lists files and directories in the current directory.
  3. Yes: uses the absolute path explicitly.
  4. Correct: see explanations above.

Exploring More ls Arguments

What does the command ls do when used with the -l and -h arguments?

Some of its output is about properties that we do not cover in this lesson (such as file permissions and ownership), but the rest should be useful nevertheless.

Solution

The -l arguments makes ls use a long listing format, showing not only the file/directory names but also additional information such as the file size and the time of its last modification. The -h argument makes the file size “human readable”, i.e. display something like 5.3K instead of 5369.

Listing Recursively and By Time

The command ls -R lists the contents of directories recursively, i.e., lists their sub-directories, sub-sub-directories, and so on in alphabetical order at each level. The command ls -t lists things by time of last change, with most recently changed files or directories first. In what order does ls -R -t display things? Hint: ls -l uses a long listing format to view timestamps.

Solution

The directories are listed alphabetical at each level, the files/directories in each directory are sorted by time of last change.

Key Points

  • Your current directory is referred to as the working directory.

  • To change directories, use cd.

  • To view files, use ls.

  • You can view help for a command with man command or command --help.

  • Hit Tab to autocomplete whatever you’re currently typing.


Writing and reading files

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How do I create/edit text files?

  • How do I move/copy/delete files?

Objectives
  • Learn to use the nano text editor.

  • Understand how to move, create, and delete files.

Now that we know how to move around and look at things, let’s learn how to read, write, and handle files! We’ll start by moving back to our home directory and creating a scratch directory:

$ cd ~
$ mkdir cloud-test
$ cd cloud-test

Creating and Editing Text Files

When working on a command line, it is useful to be able to create or edit text files. Text is one of the simplest computer file formats, defined as a simple sequence of text lines. Python scripts (.py), R scripts (.R) and Java source codes (.java) are all examples of textual file formats.

What if we want to make a file? There are a few ways of doing this, the easiest of which is simply using a text editor. For this lesson, we are going to us nano, since it’s more intuitive than many other terminal text editors.

To create or edit a file, type nano <filename>, on the terminal, where <filename> is the name of the file. If the file does not already exist, it will be created. Let’s make a new file now, type whatever you want in it, and save it.

$ nano draft.txt
The nano command-line text-editor
The nano text editor

Nano defines a number of shortcut keys (prefixed by the Control or Ctrl key) to perform actions such as saving the file or exiting the editor. Here are the shortcut keys for a few common actions:

Do a quick check to confirm our file was created.

$ ls
draft.txt

Reading Files

Let’s read the file we just created now. There are a few different ways of doing this, one of which is reading the entire file with cat.

$ cat draft.txt
It's not "publish or perish" any more,
it's "share and thrive".

By default, cat prints out the content of the given file. Although cat may not seem like an intuitive command with which to read files, it stands for “concatenate”. Giving it multiple file names will print out the contents of the input files in the order specified in the cat’s invocation. For example,

$ cat draft.txt draft.txt
It's not "publish or perish" any more,
it's "share and thrive".
It's not "publish or perish" any more,
it's "share and thrive".

Reading Multiple Text Files

Create two more files using nano, giving them different names such as chap1.txt and chap2.txt. Then use a single cat command to read and print the contents of draft.txt, chap1.txt, and chap2.txt.

Creating Directory

We’ve successfully created a file. What about a directory? We’ve actually done this before, using mkdir.

$ mkdir files
$ ls
draft.txt  files

Moving, Renaming, Copying Files

Moving — We will move draft.txt to the files directory with mv (“move”) command. The same syntax works for both files and directories: mv <file/directory> <new-location>

$ mv draft.txt files
$ cd files
$ ls
draft.txt

Renamingdraft.txt isn’t a very descriptive name. How do we go about changing it? It turns out that mv is also used to rename files and directories. Although this may not seem intuitive at first, think of it as moving a file to be stored under a different name. The syntax is quite similar to moving files: mv oldName newName.

$ mv draft.txt newname.testfile
$ ls
newname.testfile

File extensions are arbitrary

In the last example, we changed both a file’s name and extension at the same time. On UNIX systems, file extensions (like .txt) are arbitrary. A file is a .txt file only because we say it is. Changing the name or extension of the file will never change a file’s contents, so you are free to rename things as you wish. With that in mind, however, file extensions are a useful tool for keeping track of what type of data it contains. A .txt file typically contains text, for instance.

Copying — What if we want to copy a file, instead of simply renaming or moving it? Use cp command (an abbreviated name for “copy”). This command has two different uses that work in the same way as mv:

Let’s try this out.

$ cp newname.testfile copy.testfile
$ ls
$ cp newname.testfile ..
$ cd ..
$ ls
newname.testfile copy.testfile
files documents newname.testfile

Removing files

We’ve begun to clutter up our workspace with all of the directories and files we’ve been making. Let’s learn how to get rid of them. One important note before we start… when you delete a file on UNIX systems, they are gone forever. There is no “recycle bin” or “trash”. Once a file is deleted, it is gone, never to return. So be very careful when deleting files.

Files are deleted with rm file [moreFiles]. To delete the newname.testfile in our current directory:

$ ls
$ rm newname.testfile
$ ls
files Documents newname.testfile
files Documents

That was simple enough. Directories are deleted in a similar manner using rm -r (the -r option stands for ‘recursive’).

$ ls
$ rm -r Documents
$ rm -r files
$ ls
files Documents
rmdir: failed to remove `files/': Directory not empty
files

What happened? As it turns out, rmdir is unable to remove directories that have stuff in them. To delete a directory and everything inside it, we will use a special variant of rm, rm -rf directory. This is probably the scariest command on UNIX- it will force delete a directory and all of its contents without prompting. ALWAYS double check your typing before using it… if you leave out the arguments, it will attempt to delete everything on your file system that you have permission to delete. So when deleting directories be very, very careful.

What happens when you use rm -rf accidentally

Steam is a major online sales platform for PC video games with over 125 million users. Despite this, it hasn’t always had the most stable or error-free code.

In January 2015, user kevyin on GitHub reported that Steam’s Linux client had deleted every file on his computer. It turned out that one of the Steam programmers had added the following line: rm -rf "$STEAMROOT/"*. Due to the way that Steam was set up, the variable $STEAMROOT was never initialized, meaning the statement evaluated to rm -rf /*. This coding error in the Linux client meant that Steam deleted every single file on a computer when run in certain scenarios (including connected external hard drives). Moral of the story: be very careful when using rm -rf!

Looking at files

Sometimes it’s not practical to read an entire file with cat- the file might be way too large, take a long time to open, or maybe we want to only look at a certain part of the file. As an example, we are going to look at a large datset from the National Insititute for Public Health and the Environment that lists executed COVID-19 tests.

Let’s use the command line to download the latest version of this file directly to the server. To do this, we’ll use wget (wget link downloads a file from a link).

$ wget https://data.rivm.nl/covid-19/COVID-19_uitgevoerde_testen.csv

Let view the contents of the file by running cat on it.

Fortunately, this file is not that huge, but you can imagine what happens if you do this for a gigantic dataset. In such a case, you can stop cat by pressing Ctrl + C at the same time.

So, cat is a poor option when reading big files… it scrolls through the entire file far too quickly! What are the alternatives? Try all of these out and see which ones you like best!

Out of cat, head, tail, and less, which method of reading files is your favourite? Why?

Key Points

  • Use nano to create or edit text files from a terminal.

  • Use cat file1 [file2 ...] to print the contents of one or more files to the terminal.

  • Use mv old dir to move a file or directory old to another directory dir.

  • Use mv old new to rename a file or directory old to a new name.

  • Use cp old new to copy a file under a new name or location.

  • Use cp old dir copies a file old into a directory dir.

  • Use rm old to delete (remove) a file.

  • File extensions are entirely arbitrary on UNIX systems.


(Option 1 - Python) Run your own code

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How do I run Python programs?

  • What options do I have to pass data to my programs?

Objectives
  • Run Python programs from the command line

  • Understand how a program can read what you type into the terminal

  • Be able to pass arguments when you execute the program

  • Make a flexible Python program that can except different arguments

Example code

In this episode you will work with example code. During the workshop your instructor will have set up a user account that comes with the example code already provided. If you are not participating in the workshop, you can download an archive with all the files assumed to be present during this episode.

Note: this episode is written for participants who prefer to work with Python. There are also versions of this episode for participants who prefer R or who prefer Java. If you are finished with this lesson, you can also go to the next episode.

In this episode we will run our own code, written in Python, from the command line. Let us suppose that we have written a program that searches for integer number in a certain range that have all the numbers in a list as their divisor.

For this we have written the following program a put it in a file divisors.py. You can find this file in the directory ~/examples/python.

# These are the parameters
upper_bound = 100
divisors = [3, 5]

# Here we search for suitable numbers
result = []
for i in range(1,upper_bound):
   divisable = True
   for div in divisors:
      if i % div != 0:
         divisable = False
   if divisable:
      result.append(i)

print("The following numbers are divisible by all of the divisors")
print(result)

Portable Code

Since Python scripts are interpreted, you can easily write them on a Windows or Mac computer, and then run them on a Linux computer without modification. Making sure code runs on multiple operating system is called writing portable code. For most Python code, things will work without adaption, but you may have to pay attention with things that can differ between systems, such as:

  • Don’t hard code absolute file paths that only exist on your local computer, such as C:\Users\Jane McDoe\mydata.csv, but always use relative paths, e.g. mydata.csv.
  • Avoid hard coding file separator symbols, i.e. / and \ as they differ between Windows and Mac/Linux. Use the property os.path.sep or the function os.path.join if you need this, as it will take value depending on the operating system your program is ran on.
  • Avoid calling system specific commands, such as os.system('ls'), as it will not work on a Windows system.
  • If your program uses compiled or native libraries, be sure that the native libraries are available on all operating systems you want to run your program on.

We will now consider how we can run this program from the command line.

Running a Python program

Once we have a Python program in a file like this, it is actually not that difficult to run it from the command line. We can start the Python interpreter using the command python3.

$ python3

which put us into an interactive mode. Here, we can evaluate Python code by directly typing it into the interpreter. In fact, it works very similar to the bash shell, except that it speaks a different language.

Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> x = [1,2,3]
>>> x.append("hello")
>>> print(x)
[1, 2, 3, 'hello']
>>> exit()

While we could type the Python program line by line into this interactive mode, this is not very convenient, so let’s quit the interactive mode for now by typing exit().

If we type a filename that contains a Python script after the python3 command, it will execute that script directly, rather than use the interactive mode. Let’s try and do this for our program.

$ cd ~/examples/python
$ python3 divisors.py

If all goes well, we should see the output of our program!

The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90]

However, it might be nice if we can make the program a bit more flexible. We will look at two ways to do so: using standard in, and via command line arguments.

Reading data typed into the terminal

While you learned programming, you may have written interactive programs where the program would ask you for your name, and then printed it back to you. This works fine from the command line: the input() call in Python can be used to read input typed by the user. Let us adjust the code a bit to let the user enter the bound and the divisors.

upper_bound = input("What is the upper bound of numbers to be considered?\n")
upper_bound = int(upper_bound)

divisors = []
read_more = True
while read_more:
   next = input("Enter a divisor you want to consider (or -1 if you are done)\n")
   next = int(next)
   if next == -1:
      read_more = False
   else:
      divisors.append(next)

# The code with the computation goes here

This version of the program is stored in the file divisors-stdin.py. Let’s run it:

$ python3 divisors-stdin.py

When we run this, we get interactive prompts where we can specify the bound and the divisors, as follows:

What is the upper bound of numbers to be considered?
100
Enter a divisor you want to consider (or -1 if you are done)
3
Enter a divisor you want to consider (or -1 if you are done)
5
Enter a divisor you want to consider (or -1 if you are done)
-1
The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90]

Bash actually has a handy feature we can use if we want to avoid manually typing in all the input ourselves. Rather than sending data from the keyboard to the standard input of the Python program, we can send the data in a text file instead. We can do so by with the < operator on the command line. There is already a text file we can try this with: ~/examples/data1.txt that has the following contents:

100
3
5
-1

Let us send this as input to our Python program using the following command:

$ python3 divisors-stdin.py < ~/examples/data1.txt

This gives us the following output:

What is the upper bound of numbers to be considered?
Enter a divisor you want to consider (or -1 if you are done)
Enter a divisor you want to consider (or -1 if you are done)
Enter a divisor you want to consider (or -1 if you are done)
The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90]

Input redirection hides the input

When we redirect input from a file to our Python program, the data from the file is not shown in the printed. If we type the data into the terminal ourselves, the text is printed only because we type it. This is why we do not see the numbers from ~/examples/data1.txt in our output.

In some cases, this can be more convenient than typing the input directly into the program. See what happens if you run the program with a different data file, ~/examples/data2.txt. Try editing the file and see if you can get a different output!

Reading command line arguments from our program

While reading input from a file passed via the command line does help to make our program more flexible, it can also be inconvenient that we must prepare a file with the things to send to the program.

When we worked with other terminal commands, such as ls, or it was possible to pass specific arguments such as -l that adjusted the behavior of the program. We can do something similar with our own program.

If we import the Python module sys, the property sys.argv contains a list with the arguments passed to our program. The first element, sys.argv[0] is the name of the Python script and is often not that interesting, but the other arguments, i.e. sys.argv[1:] are! To test this, the following two line Python program can be found under ~/examples/python/printargs.py:

import sys
print(sys.argv[1:])

If we run python3 printargs.py 100 3 5 we get the following output:

['100', '3', '5']

That seems to work! Let’s rewrite our program so that it works with these command line arguments rather than the standard input. This would give the following code:

import sys

# Converts the first command line argument to an int
upper_bound = int(sys.argv[1])
# Takes the remaining command line arguments sys.argv[2:] and converts them to ints using a list comprehension
divisors = [int(arg) for arg in sys.argv[2:]]

# The code with the computation goes here

This Python script is available as divisors-args.py. Let us run it using

$ python3 divisors-args.py

Unfortunately, this gives an error:

Traceback (most recent call last):
  File "divisors-args.py", line 4, in <module>
    upper_bound = int(sys.argv[1])
IndexError: list index out of range

The reason for this error is that Python expects there to be some arguments in sys.argv, but we forgot to pass them!

A list of strings and arguments with spaces

The arguments passed to our Python script will be accessible within our Python program as sys.argv. Note that all arguments are a string, thus we first need to convert them to other data types if that is desirable (in the example we convert them to int’s). The arguments are separated by spaces. Thus if we run python3 printargs.py hello there, we can see sys.argv[1:] is the list ['hello', 'there']. To avoid this, you can pass an argument that contains a space by surrounding it with quotation marks, i.e. python3 printargs.py "hello there", which gives the list ['hello there'].

It can be handy to use strings in your programs. Such strings can be names of files from which your program should read data, write data to, an URL to retrieve data from or some non-numeric property.

Let’s fix this problem by adding some command line arguments

$ python3 divisors-args.py 100 3 5

Fortunately, this gives the correct output:

The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90]

Feel free to play around with this! What happens if you pass different numbers?

Nicer command line arguments with a parser

Now that we have learned how we can use arguments passed on the command line within our Python program, there are a few things to consider. First, the error message we got when we forgot to pass arguments was not really helpful. Second, if we have many arguments, it can become a hassle to remember a fixed order we should provide them. Furthermore, we may not want to be forced to provide all arguments at the same time. For the sake of usability, it is often a good idea to have sensible defaults for the settings of your program, and let the user only write command line arguments for settings he desired to override. Thus, if we compare our Python program to the comprehensive help we get when we run ls --help, and it is clear that there is still room for improvement.

Python comes with a very helpful library that allows us to improve this rather easily, the argparse library. This library gives us the option to easily construct a argument parser, which is a program that can process command line arguments for us, and provide help and meaningful error messages in case of trouble. After the argument parser is constructed, we let it consume the commandline arguments and if the parser completes without raising an exception it gives us easy access to the arguments we are interested in.

Setting up the argparse library is not that difficult, and is done in the script divisors-argparse.py. We use the following code to set up the argument parser:

import argparse

# Construct the command line parser
parser = argparse.ArgumentParser(description="Find numbers that are divisible by a list of divisors")
parser.add_argument('--bound', '-b', type=int, default=100, help="An upper bound on which numbers to check")
parser.add_argument('--divs', '-d', type=int, nargs='*', required=True, help="A list of divisors to check for")

# Try to parse the arguments using the parser
args = parser.parse_args()

# Converts the first command line argument to an int
upper_bound = args.bound
# Takes the remaining command line arguments sys.argv[2:] and converts them to ints using a list comprehension
divisors = args.divs

As you can see, setting up the parser requires only three lines of code: one for the parser itself and one line for each argument. For each argument we specify a number of properties: the names of the argument, the type of the argument, and a help message to display. Furthermore, we can define the default value of the argument if the user omits it, and we can also specify arguments that can have any number of values, such as the --divs argument in the example does with the nargs='*' attribute. This tells the argument parser multiple numbers can be provided behind this argument, and the parser will then give a list of values, rather than a singular value.

First, let’s try our program with the --help argument:

$ python3 divisors-argparse.py --help

which prints

usage: divisors-argparse.py [-h] [--bound BOUND] --divs [DIVS [DIVS ...]]

Find numbers that are divisible by a list of divisors

optional arguments:
  -h, --help            show this help message and exit
  --bound BOUND, -b BOUND
                        An upper bound on which numbers to check
  --divs [DIVS [DIVS ...]], -d [DIVS [DIVS ...]]
                        A list of divisors to check for

If we call the script without any arguments, i.e. python3 divisors-argparse.py, we get a much more comprehensible error message as well.

usage: divisors-argparse.py [-h] [--bound BOUND] --divs [DIVS [DIVS ...]]
divisors-argparse.py: error: the following arguments are required: --divs/-d

This can easily be fixed by adding the proper argument. When we run

$ python3 divisors-argparse.py --divs 3 5

the output looks like we have come to expect. Notice that we did not provide the --bound argument, and that the default value 100 is still being used! If we want to want to know the divisors for a different bound, we can add it (either before or after the --divs argument).

$ python3 divisors-argparse.py --divs 3 5 --bound 250

which finally prints

The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225, 240]

Modify the program

Add a third argument --forbid (and short name -f) that accepts a list of divisors that should not be a divisor of a number. This argument should be optional. Copy divisors-argparse.py to a new file divisors-argparse2.py and adjust the code to include this new option correctly. You can consider to make the --divs argument optional as well, but this is not required. Make sure that running

$ python3 divisors-argparse2.py -d 3 5

produces the output

The following numbers adhere to the rules defined
[15, 30, 45, 60, 75, 90]

and that

$ python3 divisors-argparse2.py -d 3 5 -f 20

produces the output

The following numbers adhere to the rules defined
[15, 30, 45, 75, 90]

Solution

First, it is necessary to add the new command to the parser:

add a line that extracts the argument into a variable with a name such as forbidden:

and finally add an extra loop that sets divisable to False if i is divisable by a number in forbidden. The complete solution looks as follows:

Key Points

  • Your own Python programs can be run from the command line

  • There are different option to pass data into your program


(Option 2 - R) Run your own code

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do I run R programs?

  • What options do I have to pass data to my programs?

Objectives
  • Run R programs from the command line

  • Understand how a program can read what you type into the terminal

  • Be able to pass arguments when you execute the program

  • Make a flexible R program that can except different arguments

Example code

In this episode you will work with example code. During the workshop your instructor will have set up a user account that comes with the example code already provided. If you are not participating in the workshop, you can download an archive with all the files assumed to be present during this episode.

Note: this episode is written for participants who prefer to work with R. There are also versions of this episode for participants who prefer Python or who prefer Java. If you are finished with this lesson, you can also go to the next episode.

In this episode we will run our own code, written in R, from the command line. Let us suppose that we have written a program that searches for integer number in a certain range that have all the numbers in a list as their divisor.

For this we have written the following program a put it in a file divisors.R. You can find this file in the directory ~/examples/R.

upper_bound <- 100
divisors <- c(3, 5)

result <- c()
for (i in 1:upper_bound) {
   divisable <- TRUE
   for (div in divisors) {
      if (i %% div != 0) {
         divisable <- FALSE
      }
   }
   if (divisable) {
      result <- append(result, c(i))
   }
}

Portable Code

Since R scripts are interpreted, you can easily write them on a Windows or Mac computer, and then run them on a Linux computer without modification. Making sure code runs on multiple operating system is called writing portable code. For most R code, things will work without adaption, but you may have to pay attention with things that can differ between systems, such as:

  • Don’t hard code absolute file paths that only exist on your local computer, such as C:\Users\Jane McDoe\mydata.csv, but always use relative paths, e.g. mydata.csv.
  • Avoid hard coding file separator symbols, i.e. / and \ as they differ between Windows and Mac/Linux. Use the property .Platform$file.sep if you need this, as it will take value depending on the operating system your program is ran on, or look for a more stable way to construct file paths in the documentation.
  • Avoid calling system specific commands, such as system('ls'), as it will not work on a Windows system.
  • If your program uses compiled or native libraries, be sure that the native libraries are available on all operating systems you want to run your program on.

We will now consider how we can run this program from the command line.

Running an R program

Once we have a R program in a file like this, it is actually not that difficult to run it from the command line. We can start the R interpreter using the command R.

$ R

which put us into an interactive mode. Here, we can evaluate R code by directly typing it into the interpreter. In fact, it works very similar to the bash shell, except that it speaks a different language.

R version 4.0.5 (2021-03-31) -- "Shake and Throw"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> x <- c(1,2,3)
> append(x, 'hello')
[1] "1"     "2"     "3"     "hello"
> q()
Save workspace image? [y/n/c]: n

While we could type the R program line by line into this interactive mode, this is not very convenient, so let’s quit the interactive mode for now by typing q() and .

The R software distribution comes with a second program we can use to execute R scripts, which is called Rscript. We have to provide a filename that contains an R script after the Rscript command. It will then just execute the script and not run in interactive mode. Let’s try and do this for our program.

$ cd ~/examples/R
$ Rscript divisors.R

If all goes well, we should see the output of our program!

The following numbers are divisible by all of the divisors
15, 30, 45, 60, 75, 90

However, it might be nice if we can make the program a bit more flexible. We will look at two ways to do so: using standard in, and via command line arguments.

Reading data typed into the terminal

While you learned programming, you may have written interactive programs where the program would ask you for your name, and then printed it back to you. This works fine from the command line: the scan() call in R can be used to read input typed by the user. If we want to let the user enter a single numeric value from the command line, we can use the following:

scan(file=stdin, what=integer(0), n=1, quiet=TRUE)

Here file=stdin means we are reading from the command line, what=integer(0) means we are reading an integer number, n=1 means we are reading a single value and quiet=TRUE supresses output generated by the scan function when it is done reading. Now let’s adjust the code a bit to let the user enter the bound and the divisors.

stdin <- file("stdin", "r")
read_more <- TRUE
cat("What is the upper bound of numbers to be considered?\n")
upper_bound <- scan(file=stdin, what=integer(0), n=1, quiet=TRUE)
divisors <- c()
while (read_more) {
   cat("Enter a divisor you want to consider (or -1 if you are done)\n")
   div <- scan(file=stdin, what=integer(0), n=1, quiet=TRUE)
   if (div == -1) {
      read_more <- FALSE
   }
   else {
      # The append function can be used to add an element to an existing vector
      divisors <- append(divisors, div)
   }
}

# The code with the computation goes here

This version of the program is stored in the file divisors-stdin.R. Let’s run it

$ Rscript divisors-stdin.r

When we run this, we get interactive prompts where we can specify the bound and the divisors, as follows:

What is the upper bound of numbers to be considered?
100
Enter a divisor you want to consider (or -1 if you are done)
3
Enter a divisor you want to consider (or -1 if you are done)
5
Enter a divisor you want to consider (or -1 if you are done)
-1
The following numbers are divisible by all of the divisors
15, 30, 45, 60, 75, 90

Bash actually has a handy feature we can use if we want to avoid manually typing in all the input ourselves. Rather than sending data from the keyboard to the standard input of the R program, we can send the data in a text file instead. We can do so by with the < operator on the command line. There is already a text file we can try this with: ~/examples/data1.txt that has the following contents:

100
3
5
-1

Let us send this as input to our R program using the following command:

$ Rscript divisors-stdin.R < ~/examples/data1.txt

This gives us the following output:

What is the upper bound of numbers to be considered?
Enter a divisor you want to consider (or -1 if you are done)
Enter a divisor you want to consider (or -1 if you are done)
Enter a divisor you want to consider (or -1 if you are done)
The following numbers are divisible by all of the divisors
15, 30, 45, 60, 75, 90

Input redirection hides the input

When we redirect input from a file to our R program, the data from the file is not shown in the printed. If we type the data into the terminal ourselves, the text is printed only because we type it. This is why we do not see the numbers from ~/examples/data1.txt in our output.

In some cases, this can be more convenient than typing the input directly into the program. See what happens if you run the program with a different data file, ~/examples/data2.txt. Try editing the file and see if you can get a different output!

Reading command line arguments from our program

While reading input from a file passed via the command line does help to make our program more flexible, it can also be inconvenient that we must prepare a file with the things to send to the program.

When we worked with other terminal commands, such as ls, or it was possible to pass specific arguments such as -l that adjusted the behavior of the program. We can do something similar with our own program.

We can use the R function commandArgs() to obtain a list of arguments passed to our program. Since the clean call returns a lot of additional information, such as the name of the program, it is useful to call it using commandArgs(trailingOnly=TRUE) to make sure we only get the arguments given on the command line after the name of the R script.

To test this, we provided a two line R script in ~/examples/R/printargs.R:

args <- commandArgs(trailingOnly=TRUE)
print(args)

If we run Rscript printargs.R 100 3 5 we get the following output:

[1] "100" "3"   "5"

That seems to work! Let’s rewrite our program so that it works with these command line arguments rather than the standard input. This would give the following code:

# Read the command line arguments and convert them to numbers
args <- commandArgs(trailingOnly=TRUE)
args <- as.numeric(args)

# Extract the first element as the upper bound
upper_bound <- args[1]
# Extract the remaining elements as the divisor
divisors <- args[2:length(args)]

# The code with the computation goes here

This R script is available as divisors-args.R. Let us run it using

$ Rscript divisors-args.R

Unfortunately, this gives an error:

Error in 1:upper_bound : NA/NaN argument
Execution halted

The reason for this error is that the R script expects there are some arguments passed, but these values are missing. The reason is of course that we forgot to pass them!

A list of strings and arguments with spaces

The arguments passed to our R script will be returned within our R program by commandArgs(trailingOnly=TRUE). Note that all arguments are a string, thus we first need to convert them to other data types if that is desirable (in the example we convert them to numbers using the as.numeric() function). The arguments we pass are separated by spaces. Thus if we run Rscript printargs.R hello there, we can see the commandArgs function gives us the vector "hello" "there". To avoid this, you can pass an argument that contains a space by surrounding it with quotation marks, i.e. Rscript printargs.R "hello there", which gives the vector "hello there".

It can be handy to use strings in your programs. Such strings can be names of files from which your program should read data, write data to, an URL to retrieve data from or some non-numeric property.

Let’s fix this problem by adding some command line arguments

$ Rscript divisors-args.R 100 3 5

Fortunately, this gives the correct output:

The following numbers are divisible by all of the divisors
15, 30, 45, 60, 75, 90

Feel free to play around with this! What happens if you pass different numbers?

Nicer command line arguments with a parser

Now that we have learned how we can use arguments passed on the command line within our R program, there are a few things to consider. First, the error message we got when we forgot to pass arguments was not really helpful. Second, if we have many arguments, it can become a hassle to remember a fixed order we should provide them. Furthermore, we may not want to be forced to provide all arguments at the same time. For the sake of usability, it is often a good idea to have sensible defaults for the settings of your program, and let the user only write command line arguments for settings he desired to override. Thus, if we compare our R program to the comprehensive help we get when we run ls --help, and it is clear that there is still room for improvement.

There exists a very helpful R package that allows us to improve this rather easily, the argparser package. This package gives us the option to easily construct a argument parser, which is a program that can process command line arguments for us, and provide help and meaningful error messages in case of trouble. After the argument parser is constructed, we let it consume the command line arguments and if the parser completes without raising an exception it gives us easy access to the arguments we are interested in.

Setting up the argparser library is not that difficult, and is done in the script divisors-argparser.R. We use the following code to set up the argument parser:


# Load the argparser library and install it if it is missing
if (!require("argparser", quiet=TRUE)) {
   install.packages("argparser")
   library(argparser)
}

# First we build up the argument parser
parser <- arg_parser("Find numbers that are divisible by a list of divisors")
parser <- add_argument(parser, "--bound", short="-b", help="An upper bound on which numbers to check", default=100)
parser <- add_argument(parser, "--divs", short="-d", help="A list of divisors to check for", nargs=Inf)


# Read the command line arguments and parse the arguments
args <- commandArgs(trailingOnly=TRUE)
parsed_args <- parse_args(parser, args)

# Extract the parsed data
upper_bound <- as.numeric(parsed_args$bound)
divisors <- as.numeric(parsed_args$divs)

# Check if some divisors are given
if (any(is.na(divisors))) {
   cat("Please provide one or more divisors\n")
   quit()
}

As you can see, setting up the parser requires only three lines of code: one for the parser itself and one line for each argument. For each argument we specify a number of properties: the names of the argument, the type of the argument, and a help message to display. Furthermore, we can define the default value of the argument if the user omits it, and we can also specify arguments that can have any number of values, such as the --divs argument in the example does with the nargs=Inf attribute. This tells the argument parser multiple numbers can be provided behind this argument, and the parser will then give a list of values, rather than a singular value.

First, let’s try our program with the --help argument:

$ Rscript divisors-argparser.R --help

which prints

usage: divisors-argparser.R [--] [--help] [--opts OPTS] [--bound BOUND]
       [--divs DIVS]

Find numbers that are divisible by a list of divisors

flags:
  -h, --help   show this help message and exit

optional arguments:
  -x, --opts   RDS file containing argument values
  -b, --bound  An upper bound on which numbers to check [default: 100]
  -d, --divs   A list of divisors to check for

Note that the -x argument is added automatically, and can be safely ignored. Let’s try to run the script using these arguments. When we run

$ Rscript divisors-argparser.R --divs 3 5

the output looks like we have come to expect. Notice that we did not provide the --bound argument, and that the default value 100 is still being used! If we want to want to know the divisors for a different bound, we can add it (either before or after the --divs argument).

$ Rscript divisors-argparser.R --divs 3 5 --bound 250

which finally prints

The following numbers are divisible by all of the divisors
15, 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225, 240

Modify the program

Add a third argument --forbid (and short name -f) that accepts a list of divisors that should not be a divisor of a number. This argument should be optional. Copy divisors-argparser.R to a new file divisors-argparse2.R and adjust the code to include this new option correctly. You can consider to make the --divs argument optional as well, but this is not required. Make sure that running

$ Rscript divisors-argparser2.R -d 3 5

produces the output

The following numbers adhere to the rules defined
15, 30, 45, 60, 75, 90

and that

$ Rscript divisors-argparser2.R -d 3 5 -f 20

produces the output

The following numbers adhere to the rules defined
15, 30, 45, 75, 90

Hint: Be aware that you may want to add an if statement that checks if the argument turns out to NA. For example, you can use something like if (!any(is.na(forbidden))) { ... to guard against the case where no --forbid argument is passed.

Solution

First, it is necessary to add the new command to the parser:

add a line that extracts the argument into a variable with a name such as forbidden:

and finally add an extra loop that sets divisable to FALSE if i is divisable by a number in forbidden. The complete solution looks as follows:

Key Points

  • Your own R programs can be run from the command line

  • There are different option to pass data into your program


(Option 3 - Java) Run your own code

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do I run Java Programs?

  • What options do I have to pass data to my programs?

Objectives
  • Run Java programs from the command line

  • Understand how a program can read what you type into the terminal

  • Be able to pass arguments when you execute the program

  • Make a flexible Java program that can except different arguments

Example code

In this episode you will work with example code. During the workshop your instructor will have set up a user account that comes with the example code already provided. If you are not participating in the workshop, you can download an archive with all the files assumed to be present during this episode.

Note: this episode is written for participants who prefer to work with Java. There are also versions of this episode for participants who prefer Python or who prefer R. If you are finished with this lesson, you can also go to the next episode.

In this episode we will run our own code, written in Java, from the command line. Let us suppose that we have written a program that searches for integer number in a certain range that have all the numbers in a list as their divisor.

For this we have written the following program a put it in a file Divisors.java. You can find this file in the directory ~/examples/java.

import java.util.List;
import java.util.ArrayList;

public class Divisors {
   public static void main(String [] args) {
      int upperBound = 100;
      List<Integer> divisors = List.of(3, 5);

      List<Integer> result = getDivisors(upperBound, divisors);
      System.out.println("The following numbers are divisible by all of the divisors");
      System.out.println(result);
   }

   public static List<Integer> getDivisors(int upperBound, List<Integer> divisors) {
      List<Integer> result = new ArrayList<>();
      for (int i=1; i < upperBound; i++) {
         boolean divisable = true;
         for (int div : divisors) {
            if (i%div != 0) {
               divisable = false;
            }
         }
         if (divisable) {
            result.add(i);
         }
      }
      return result;
   }
}

Portable Code

Since Java compiles to bytecode that is ran on the JVM, you can easily write and compile Java code on a Windows or Mac computer, and then run it on a Linux computer without modification. Making sure code runs on multiple operating system is called writing portable code. For most Java code, things will work without adaption, but you may have to pay attention with things that can differ between systems, such as:

  • Don’t hard code absolute file paths that only exist on your local computer, such as C:\Users\Jane McDoe\mydata.csv, but always use relative paths, e.g. mydata.csv.
  • Avoid hard coding file separator symbols, i.e. / and \ as they differ between Windows and Mac/Linux. Use the property File.separator if you need this, as it will take value depending on the operating system your program is ran on, or look for a more stable way to construct file paths in the documentation.
  • Avoid calling system specific commands, such as System.exec('ls'), as it will not work on a Windows system.
  • If your program uses compiled or native libraries, be sure that the native libraries are available on all operating systems you want to run your program on.

We will now consider how we can run this program from the command line.

Running a Java program

Since Java is a compiled language, the first step is to compile the program. This can be done with the command for the Java compiler, javac. Note that we only need to compile a program once. If it is compiled, we can run it as many times as we want. We need to pass the Java compiler the files we want to compile as arguments. Let’s do so for our program:

$ cd ~/examples/java
$ javac Divisors.java

If we do not get any error, we can use ls to check if a Divisors.class was generated. If yes, this indicates the program was compiled succesfully.

After it is compiled, we can actually run it.

$ java Divisors

If all goes well, we should see the output of our program!

The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90]

However, it might be nice if we can make the program a bit more flexible. We will look at two ways to do so: using standard in, and via command line arguments.

Reading data typed into the terminal

While you learned programming, you may have written interactive programs where the program would ask you for your name, and then printed it back to you. This works fine from the command line: if we use a Scanner on System.in we can read input typed by the user. Let us adjust the code a bit to let the user enter the bound and the divisors, by adding the following to the main method.

try (Scanner scan = new Scanner(System.in)) {
   System.out.println("What is the upper bound of numbers to be considered?");
   int upperBound = scan.nextInt();
   List<Integer> divisors = new ArrayList<>();

   boolean readMore = true;
   while (readMore) {
      System.out.println("Enter a divisor you want to consider (or -1 if you are done)");
      int div = scan.nextInt();
      if (div == -1) {
         readMore = false;
      }
      else {
         divisors.add(div);
      }
   }

   // Perform the computation here
} 

This version of the program is stored in the file DivisorsStdIn.java. Let’s compile and run it.

$ javac DivisorsStdIn.java
$ java DivisorsStdIn

When we run this, we get interactive prompts where we can specify the bound and the divisors, as follows:

What is the upper bound of numbers to be considered?
100
Enter a divisor you want to consider (or -1 if you are done)
3
Enter a divisor you want to consider (or -1 if you are done)
5
Enter a divisor you want to consider (or -1 if you are done)
-1
The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90]

Bash actually has a handy feature we can use if we want to avoid manually typing in all the input ourselves. Rather than sending data from the keyboard to the standard input of the Java program, we can send the data in a text file instead. We can do so by with the < operator on the command line. There is already a text file we can try this with: ~/examples/data1.txt that has the following contents:

100
3
5
-1

Let us send this as input to our Java program using the following command:

$ java DivisorsStdIn < ~/base/examples/data1.txt

This gives us the following output:

What is the upper bound of numbers to be considered?
Enter a divisor you want to consider (or -1 if you are done)
Enter a divisor you want to consider (or -1 if you are done)
Enter a divisor you want to consider (or -1 if you are done)
The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90]

Input redirection hides the input

When we redirect input from a file to our Java program, the data from the file is not shown in the printed. If we type the data into the terminal ourselves, the text is printed only because we type it. This is why we do not see the numbers from ~/examples/data1.txt in our output.

In some cases, this can be more convenient than typing the input directly into the program. See what happens if you run the program with a different data file, ~/examples/data2.txt. Try editing the file and see if you can get a different output!

Reading command line arguments from our program

While reading input from a file passed via the command line does help to make our program more flexible, it can also be inconvenient that we must prepare a file with the things to send to the program.

When we worked with other terminal commands, such as ls, or it was possible to pass specific arguments such as -l that adjusted the behavior of the program. We can do something similar with our own program.

Every main method in Java always has an array declared, like String [] args. You may have wondered what this array is used for. In fact, it contains any command line arguments passed to the Java program!

To test this, the following two line Java program can be found under ~/examples/java/PrintArgs.java:

public class PrintArgs {
   public static void main(String [] args) {
      for (String arg: args) {
         System.out.println(arg);
      }
   }
}

If we first compile it with javac PrintArgs.java and then run java PrintArgs 100 3 5 we get the following output:

100
3
5

That seems to work! Let’s rewrite our program so that it works with these command line arguments rather than the standard input. For this we would add the following code to the main method:

int upperBound = Integer.parseInt(args[0]);
List<Integer> divisors = new ArrayList<>();
for (int i=1; i < args.length; i++) {
   divisors.add(Integer.parseInt(args[i]));
}

// The code with the computation goes here

This Java program is available as DivisorsArgs.java. Let us run it using

$ javac DivisorsArgs.java
$ java DivisorsArgs

Unfortunately, this gives an error:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at DivisorsArgs.main(DivisorsArgs.java:7)

The reason for this error is that our program expects there to be some arguments in the String [] args, but we forgot to pass them!

A list of strings and arguments with spaces

The arguments passed to our Java program will be accessible within the main method of our Java program as the String [] argument of this method. Since it is an array of String, we first need to convert them to other data types if that is desirable (in the example we convert them to int using Integer.parseInt). The arguments are separated by spaces. Thus if we run java PringArgs hello there, we get two separate lines with hello and there. To avoid this, you can pass an argument that contains a space by surrounding it with quotation marks, i.e. java PrintArgs "hello there", which will then print hello there on a single line.

It can be handy to use strings in your programs. Such strings can be names of files from which your program should read data, write data to, an URL to retrieve data from or some non-numeric property.

Let’s fix this problem by adding some command line arguments

$ java DivisorsArgs 100 3 5

Fortunately, this gives the correct output:

The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90]

Feel free to play around with this! What happens if you pass different numbers?

Nicer command line arguments with a parser

Now that we have learned how we can use arguments passed on the command line within our Java program, there are a few things to consider. First, the error message we got when we forgot to pass arguments was not really helpful. Second, if we have many arguments, it can become a hassle to remember a fixed order we should provide them. Furthermore, we may not want to be forced to provide all arguments at the same time. For the sake of usability, it is often a good idea to have sensible defaults for the settings of your program, and let the user only write command line arguments for settings he desired to override. Thus, if we compare our Java program to the comprehensive help we get when we run ls --help, and it is clear that there is still room for improvement.

While Java does not come with a library for parsing command line arguments out of the box, the Picocli library is a rather powerful library that helps us improve this rather easily. This library gives us the option to easily construct a argument parser, which is a program that can process command line arguments for us, and provide help and meaningful error messages in case of trouble. After the argument parser is constructed, we let it consume the commandline arguments and if the parser completes without raising an exception it gives us easy access to the arguments we are interested in.

Setting up a java library for your project can be a bit of a hassle, as both the compiler javac as well as the JVM runtime java need to be able to find it. A jar file containing version 4.6.1 of the picocli library can be found in ~/examples/java/picocli-4.6.1.jar. We will pass the argument -cp .:picocli-4.6.1jar to make sure they are able to.

Alternative: use a maven project

If you work with many libraries, an alternative approach to passing the libraries via the -cp argument to the compiler and JVM, is to use a build tool to package everything in one big .jar file, and then run that .jar file directly. Maven is one of the most famous build tools for Java.

In the directory ~/examples/java/project you can find a maven project that is configured to automatically download and package picocli with your own code. The code can be found in ~/examples/java/project/main/java/examples/DivisorsArgParse.java and the maven project configuration file in ~/examples/java/project/pom.xml.

To compile the project using maven, you can do so using

$ cd ~/examples/java/project
$ mvn package

This will create a single runnable jar file called ~/examples/java/project/target/divisors-argparse.jar. You can run this file as follows:

$ java -jar ~/examples/java/project/target/divisors-argparse.jar

The main advantage of this is that you do not have to specify the library every time. You can even copy the .jar packaged file to another computer and run it there. Similarly, if you export a runnable .jar file from Eclipse or IntelliJ on your local computer, you can use this command to run this file directly on Linux as well.

Once we make sure the library can be found by Java, using it is not that difficult. We do need to change the way our program is set up a little bit, as picocli will use the arguments it finds to modify instance variables of an object automatically, and will the call the run() method of the Runnable interface on that object. We can use a @Command annotation on the class to define a general help message, and @Option annotations on the instance variables to link the instance variables to Therefore, we define the class and its (protected) instance variables as follows:

@Command(description="Find numbers that are divisible by a list of divisors")
public class DivisorsArgParse implements Runnable {

   @Option(names={"--bound","-b"}, defaultValue="100", description="An upper bound on which numbers to check")
   protected int upperBound;

   @Option(names={"--divs","-d"}, required=true, arity="1..*", description="A list of divisors to check for")
   protected List<Integer> divisors;

   // The methods go here
   // ...
}

We then add the run method that will be executed after picocli manages to parse the command line arguments. This will run the old getDivisors() method we had, which we will also adjust to work with the instance variables rather than with arguments to the methods. This gives us:

@Override
public void run() {
   List<Integer> result = getDivisors();
   System.out.println("The following numbers are divisible by all of the divisors");
   System.out.println(result);
}

public List<Integer> getDivisors() {
   List<Integer> result = new ArrayList<>();
   for (int i=1; i < upperBound; i++) {
      boolean divisable = true;
      for (int div : divisors) {
         if (i%div != 0) {
            divisable = false;
         }
      }
      if (divisable) {
         result.add(i);
      }
   }
   return result;
}

Finally, we need to add a main method that uses the picocli class CommandLine to parse the command line arguments, configure a DivisorsArgParse object and then perform the computation. We can do so with the following code:

public static void main(String [] args) {
   // Create an empty DivisorsArgParse object
   DivisorsArgParse myObj = new DivisorsArgParse();
   // Create a picocli CommandLine object that will configure myObj and then pass it the arguments
   CommandLine cli = new CommandLine(myObj);
   cli.execute(args);
}

As you can see, setting up the parser requires only three annotations and three lines of code. Within the annotation we specify a number of properties of the argument: the names of the argument, the type of the argument, and a help message to display. Furthermore, we can define the default value of the argument if the user omits it, and we can also specify arguments that can have any number of values, such as the --divs argument in the example does with the arity="1..*" attribute. This tells the argument parser multiple numbers can be provided behind this argument, and the parser will then give a list of values, rather than a singular value.

First, let’s compile our program and then run it:

$ javac -cp .:picocli-4.6.1.jar DivisorsArgParse.java
$ java -cp .:picocli-4.6.1.jar DivisorsArgParse

which prints

Missing required option: '--divs=<divisors>'
Usage: <main class> [-b=<upperBound>] -d=<divisors>... [-d=<divisors>...]...
Find numbers that are divisible by a list of divisors
  -b, --bound=<upperBound>   An upper bound on which numbers to check
  -d, --divs=<divisors>...   A list of divisors to check for

This is a nice and clear help message, that picocli automatically generated based on the annotations in the class. It detected that the required argument --divs is missing, and warns us about this. Now let’s try to run our class with a proper --divs argument.

$ java -cp .:picocli-4.6.1.jar DivisorsArgParse --divs 3 5

the output looks like we have come to expect. Notice that we did not provide the --bound argument, and that the default value 100 is still being used! If we want to want to know the divisors for a different bound, we can add it (either before or after the --divs argument).

$ java -cp .:picocli-4.6.1.jar DivisorsArgParse --divs 3 5 --bound 250

which finally prints

The following numbers are divisible by all of the divisors
[15, 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225, 240]

Modify the program

Add a third argument --forbid (and short name -f) that accepts a list of divisors that should not be a divisor of a number. This argument should be optional. We provided a copy of DivisorsArgParse.java called DivisorsArgParse2.java which updates the class name and the object creation in the main method. The goal is to adjust the code to include this new option correctly. You can consider to make the --divs argument optional as well, but this is not required. Make sure that running

$ javac -cp .:picocli-4.6.1.jar DivisorsArgParse2.java
$ java -cp .:picocli-4.6.1.jar DivisorsArgParse2 --divs 3 5

produces the output

The following numbers adhere to the rules defined
[15, 30, 45, 60, 75, 90]

and that

$ java -cp .:picocli-4.6.1.jar DivisorsArgParse2 --divs 3 5 --forbid 20

produces the output

The following numbers adhere to the rules defined
[15, 30, 45, 75, 90]

Solution

First, it is necessary to add the new command to the parser:

This is enough to make sure picocli sets this instance variable if the --forbid argument is passed. We also add an extra loop that sets divisable to false if i is divisable by a number in forbidden. The complete solution looks as follows:

Key Points

  • Your own Java programs can be run from the command line

  • There are different option to pass data into your program


Long running and scheduled programs

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How do I prevent that programs stop running if I disconnect

  • How can I schedule a program to be executed at certain times

Objectives
  • Understand why programs are stopped once your disconnect

  • Use tmux to create a detached terminal that keeps running

  • Be able to check the current use of resources using htop

  • Be able to use cron to schedule programs to be run regularly

Example code

In this episode you will work with example code. During the workshop your instructor will have set up a user account that comes with the example code already provided. If you are not participating in the workshop, you can download an archive with all the files assumed to be present during this episode.

A computer running in the cloud can be a handy system to run a long running computation. Unfortunately, once your ssh client disconnect, the programs that are currently running in your shell session are terminated. In this episode we discuss how you can schedule programs to be run at a given time. This can be handy if, for example, you want to do some data scraping or monitor tasks. Furthermore, we consider how you can safely disconnect to the remote computer while your computations are running.

Periodic execution of programs

For some tasks, such as scraping data from a website, it can be useful if the virtual machines runs it at certain times without you starting it. Unix systems have a well know tool for this called cron, derived from the Greek word for time (Chronos). Every user has a so called crontab, a table with times and commands for cron to execute. This is basically a text file where each line first has fields separated by spaces indicating when to run a given task, followed by the command to run. The five fields are as follows:

  1. Which minute (0 - 59) should the command run
  2. Which hour (0 - 23) should the command run
  3. Which day of the month (1 - 31) should the command run
  4. Which month (1 - 12) should the command run
  5. Which day of the week (0 - 6), where 0 is Sunday and 6 is Saturday, should the command run.

For each field, we can also use a wildcard * to indicate that it should be executed at every moment, rather than a particular moment. We can also list multiple times by using comma’s, so 1,2,3 would mean only at time units 1, 2 or 3. Finally, it is also possible to use something like */5 to indicate that intervals of 5 time units should be considered (of course, different numbers than 5 can be used).

Some examples of crontab lines

# Runs myCommand at 12:00 on the first day of each month
0 12 1 * * myCommand
# Runs myCommand at 13:15, 13:30, 13:45, 15:15, 15:30 and 15:45 every day
15,30,45 13,15 * * * myCommand
# Runs myCommand every five minutes
*/5 * * * * myCommand

Let’s experiment with this. If we run

$ date

This will print the current date and time. Using the command

$ date >> ~/dates.txt

this data will be appended to the file ~/dates.txt. We will use this command to easily check if the cron tab is working. We can edit our crontab by running

$ crontab -e

If you get asked which editor to use, it is easiest to choose nano. Then, at the bottom of the file add a new line:

*/3 * * * * date >> ~/dates.txt

Make sure you add a newline, so the last line in the file is empty! Save the file with (with Ctrl+O) and exit nano (with Ctrl+X). If all goes well, you will see:

crontab: installing new crontab

Great! Now wait for some time and see if new data appears in the file ~/dates.txt, for example by running cat ~/dates.txt every once in a while.

If at some point in time you want to disable the scheduled task, you can rerun

$ crontab -e

And either add a # at the start of the line we created to make it a comment, or just remove it. Save the updated cron tab to let it take effect.

Crontab Generator

If you find it difficult to remember the syntax of the crontab, there are also websites that can help you generate the required line in a more user friendly way. For example, the site Crontab Generator can be used for this. This can make life a bit easier.

Working with long running tasks

Sometimes computational experiments take a long time to run. If you start an experiment on the command line, the running program is called a process. When you normally run a program from the shell, the created process will be a child of the shell process you are currently working in. If you disconnect to the server, this shell process will be terminated, and so will all it’s child processes. As a consequence, your long running experiments will be stopped once you disconnect.

To simulate the situation where we have a long running experiment, the program ~/examples/longtask will run for five minutes (but to keep the server usuable for other workshop participants, it will not perform any computations). Feel free to try running it:

$ cd ~/examples
$ ./longtask

You will see that it keeps running. You can interrupt it by pressing Ctrl + C on your keyboard, or by disconnecting your ssh client (closing the window typically does the trick).

Let’s try to keep this longtask running, even if we disconnect. One way is to create a virtual terminal session that keeps running, even if we disconnect from it. One program that can provide us with that is tmux, which stands for terminal multiplexer.

Virtual Machines vs Computer Clusters

We will now learn how to use tmux to run long jobs on our cluster. By using tmux, we run the computations directly on the computer we connect to. For our own virtual machine running in the cloud, this is a fine solution.

Scientific Super Computer Clusters, such as the SurfSARA clusters in the Netherlands work differently: they have a very large number of computers and work with a job queue to distribute the tasks over all the computers in the cluster. You connect via ssh to a computer from which you submit your jobs to the job queue, but you are not supposed to directly run your jobs on the computer you connect to. If you are using a super computer, you should probably check in the documentation how you can submit jobs to run.

Now let’s use tmux to create a persistent terminal session. A useful command to check if you currently have created any such session is

$ tmux ls

If no sessions are running, it will print something like:

no server running on /tmp/tmux-1000/default

Now let’s create a session by running:

$ tmux

If everything works out, we should see something that looks like the following figure.

New tmux session
A newly started tmux session

This is now a terminal session that also works with bash, but that will persist even if we disconnect from the server. Quite handy!

Sessions keep active only while the machine keeps running

Even though the tmux session keeps running when we disconnect our ssh client, events such as rebooting or shutting down the virtual machine can still terminate the sessions.

Now that we are inside the tmux session, we switch back to our original shell session by doing

$ tmux detach

Which should take us back to our original shell session with the message

[detached (from session 0)]

This means there is still a tmux session named 0 running. Let’s see if we can find it by running

$ tmux ls

which would print something like

0: 1 windows (created Fri May 14 11:08:19 2021) [80x23]

Let’s try to get back into this session. We can do so with the command

$ tmux attach -t 0

Here the -t argument indicates the name of the session we want to return to. In this case, the name of that session is 0. Now that we have reconnected to this session, let’s try to run the longtask program in it.

$ ./longtask

Now we will see something similar to the following:

A tmux session running a long task
The tmux session running the long task

If we now disconnect our ssh client, we can later reconnect to the session running tmux. However, while it is running we can not type tmux detach. Fortunately, there is also a keyboard shortcut that can be used to detach from the session without typing that command. Keyboard commands for tmux always consist of two steps. First you press Ctrl + B at the same time (be careful not to press C accidentally!). After you press these keys together and release them, you can give tmux a command. Press D and you will return to your original shell, while the long task keeps running in your tmux session named 0. Feel free to check this by doing tmux attach -t 0.

Once you are done with a tmux session, you can close it by typing

$ exit

The exit command closes the session

Be aware that running exit in your original ssh session will disconnect you!

Managing Multiple Sessions

If so desired we can easily create multiple tmux sessions in which we run different computations. Everytime we run tmux, a new session is created, with the first second being called 0, the second session 1, the third 2 etcetera. It may be helpful to give more memorable names to our sessions than 0 and 1. We may do so by typing

$ tmux new -s coolexperiment

which created a new session with the name coolexperiment. Of course, you are free to use any name you like. Now when you want to attach to this session (after de-attaching) you can use

$ tmux attach -t coolexperiment

You can make multiple sessions in which you run multiple computations in parallel this way, and you can easily switch between them using tmux and the corresponding commands.

Sessions do not clean up automatically

One convenience offered by the fact that your regular shells stops as soon as you disconnect your ssh client, is that you do not have to worry that your system becomes cluttered with many different shell sessions it has to keep in memory. In a sense, it is an automatic cleanup facility.

When you use tmux sessions, these are not automatically cleaned up. That means you have to use exit within those sessions to end them yourself, or use tmux kill-session -t session_name to end it. You can use tmux ls to check which sessions are currently running.

When you know how to create new sessions using tmux and attach and detach from them, you know everything you need to run multiple computations in parallel on your cloud machine. However, tmux offers even more flexibility: you can arrange multiple sessions side by side on the same terminal, and there are also key combinations to quickly switch between them. You can do an internet search from some tmux tutorials and if you forget thing, it can be handy to have a tmux cheatsheet ready.

Checking resource usage

Now that you know how you can create multiple session to run all kinds of computations in parallel, there is also a danger that you overload the system. Depending on how many CPU’s and memory your virtual machine has, and the amount of resources consumed by the applications you run, it can be the case that you manage to slow the system down so much that it becomes unresponsive. Even worse: your experiments may run slower if you overload the capacity by running too many experiments in parallel, compare to running them sequentially.

It is thus useful to be able to keep an eye on system resource consumption, similarly to what you can do with the Task Manager and Resource Monitor on Windows, or what the Activity Monitor can do on Mac OS. There are a number of very similar utilities on Unix systems, the most user friendly one being htop. Let’s try to run it:

$ htop

It will show you something as follows:

The htop process manager
The htop process manager showing a list of running processes.

At the top you can see which percentage of the CPU is used; in the example this is a very low percentage. Underneath you see the memory usage. Furthermore, you see a list of processes. Interestingly, htop support the mouse. If you click on the MEM%, you will see that the processes are being sorted by memory consumption, whereas click on CPU% will sort them by CPU consumption. If you click F10Quit you will close the program (alternatively you can press F10 on the keyboard).

Note that if a computer has multiples cores/cpu, unix typically counts each CPU as 100%. So on a four core system, 200% CPU usage indicates that half of the cores is being used. Helpfully, htop also clearly shows multiple CPU if they exist. Be aware though that on some machines, hyperthreading may show twice as many cores as there are available. Read the specification of the machine you are using if you are unsure!

Finally, you can also click on a process (htop will highlight) the line you click on, and force that program to stop, by pressing F9 or clicking F9Kill in the bottom. Note that unless you run htop as a super user, you will only be able to stop your own processes this way, not those started by other users.

Alternatives to htop

It is possible that the virtual machines you use yourself does have htop. In such a case, one option could be to try and install it (sudo apt install htop on Ubuntu/Debian based systems). A utility that is more commonly available, but is less user friendly is top. It displays the same things but has fewer options for sorting and does not provide as nice a summary of resource usage as htop does. Even more basic is the command ps aux, which just prints a list of processes currently running with their resource utilization.

Key Points

  • Periodic execution of jobs can be done using cron by editing your crontab

  • You can create persistent terminal sessions using tmux, that keep running even if you disconnect

  • With htop you can check on resource consumption of the running processes


System Management and Installation

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • How do I manage a Virtual Machine

  • How I can I install missing software on a Linux Virtual Machine

Objectives
  • Know the difference between a regular user and a super user

  • Understand the concept of a package manager

  • Search for some packages with apt search

Super Users and the sudo command

If you use Mac OS or Windows, you may have noticed that sometimes when you install software or change something on your computer, you have to confirm that you want to do so with an additional security prompt (on Mac OS you have to type in your password). The sudo command is the same idea. If you are trying to make some changes to your system and get the error that your account does not have enough privileges, you should try rerunning the command with sudo in front of it to run it with administrator privileges.

It is a very bad idea to work with administrator privileges by default: if you make a mistake, such as typing in an incorrect rm or mv command, you may break the system entirely. It is therefore a good idea to only use the administrator privileges when you really need them. Furthermore, if you create files as an administrator, you may not be able to access them as a regular user without change the ownership and/or permissions on that file, which can be cumbersome.

Installing packages

Even as a regular user, you can use the package manager to search for packages that you may want to install. For example, suppose you are interested in using Octave, which is an Open Source package that is very similar to MATLAB and even able to run many basic MATLAB scripts.

If we try to type

$ octave

We will get something like the following error:

Command 'octave' not found, but can be installed with:

sudo apt install octave

As we can see, Ubuntu recognized that there is a package that would provide the octave software, but that it is currently not installed. It suggest to do so with the apt install command, preceded by sudo as installing software requires elevated system permissions.

The command apt is a package manager for Debian and Ubuntu based Linux distributions. A package manager is a somewhat similar to an app store application you may know from you smart phone. It allows you to easily install software you want.

Even if you do not have administrator rights on a system, you can still use apt to search which packages are there. Let’s try and search for the octave package:

$ apt search octave

Since this shows a lot of packages that all contain the word octave, you can consider running it with less to be able to browser the output easier:

$ apt search octave | less

You can close the less tool by pressing the Q button.

As you can see, the main package for Octave is called octave. If you would have super-user rights on the virtual machine, which you do not have during the workshop, but would have if you create your own Debian or Ubuntu virtual machine, you could install the package by running

$ sudo apt install octave

Sometimes the software you want is not available in the standard package repository, or the software in the standard repository is too old for your needs. In some cases, the developers of software have additional repositories available which you can use. This is for example the case for R. To be able to install it, you can first tell the apt package manager to add an additional repository and corresponding security certificate (this is not possible during the workshop):

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
$ sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" 

After that, you should be able to install a (modern) version of R using the package manger

$ sudo apt install r-base

The general advice is too look on the website of the programming environment you want to use, to see how it can be installed in a Linux context. Often multiple options are given, but the option that works with a package manager is often easier than alternatives, such as compiling and installing the software from source code.

Commercial packages and Binary Software

In case you want to use commercial software, such as MATLAB or CPLEX this is typically not available via the package manager. First, this software must be support Linux, and then you should obtain a specific installer for Linux. For shared libraries, you also need a different type of library: a Windows dll file does not work on Linux, as the binary format is different. Look at the documentation of the software you want to use. Note that this typically only holds true for binary applications. Interpreted scripts (such as Python and R use), as well as bytecode based software (Java or C#) generally should be easier to run on different operating systems, but some care is needed if they make use of binary/compiled software.

Key Points

  • Have a general idea how additional software can be installed on a Linux machine


(BONUS) Wildcards and pipes

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I run a command on multiple files at once?

  • Is there an easy way of saving a command’s output?

Objectives
  • Redirect a command’s output to a file.

  • Process a file instead of keyboard input using redirection.

  • Construct command pipelines with two or more stages.

  • Explain what usually happens if a program or pipeline isn’t given any input to process.

In this lesson and the next, we are going to look at a large and complex file type used in bioinformatics- a .gtf file. The GTF2 format is commonly used to describe the location of genetic features in a genome.

Let’s grab and unpack a set of demo files for use later. To do this, we’ll use wget (wget link downloads a file from a link).

$ wget https://pcbouman-eur.github.io/workshop-getting-started-cloud/files/bioinformatics.tar.gz

You’ll commonly encounter .tar.gz archives while working in UNIX. To extract the files from a .tar.gz file, we run the command tar -xvf filename.tar.gz:

$ tar -xvf bash-lesson.tar.gz
dmel-all-r6.19.gtf
dmel_unique_protein_isoforms_fb_2016_01.tsv
gene_association.fb
SRR307023_1.fastq
SRR307023_2.fastq
SRR307024_1.fastq
SRR307024_2.fastq
SRR307025_1.fastq
SRR307025_2.fastq
SRR307026_1.fastq
SRR307026_2.fastq
SRR307027_1.fastq
SRR307027_2.fastq
SRR307028_1.fastq
SRR307028_2.fastq
SRR307029_1.fastq
SRR307029_2.fastq
SRR307030_1.fastq
SRR307030_2.fastq

Unzipping files

We just unzipped a .tar.gz file for this example. What if we run into other file formats that we need to unzip? Just use the handy reference below:

  • gunzip extracts the contents of .gz files
  • unzip extracts the contents of .zip files
  • tar -xvf extracts the contents of .tar.gz and .tar.bz2 files

That is a lot of files! One of these files, dmel-all-r6.19.gtf is extremely large, and contains every annotated feature in the Drosophila melanogaster genome. It’s a huge file- what happens if we run cat on it? (Press Ctrl + C to stop it). Now that we know some of the basic UNIX commands, we are going to explore some more advanced features. The first of these features is the wildcard *. In our examples before, we’ve done things to files one at a time and otherwise had to specify things explicitly. The * character lets us speed things up and do things across multiple files.

Ever wanted to move, delete, or just do “something” to all files of a certain type in a directory? * lets you do that, by taking the place of one or more characters in a piece of text. So *.txt would be equivalent to all .txt files in a directory for instance. * by itself means all files. Let’s use our example data to see what I mean.

$ tar xvf bash-lesson.tar.gz
$ ls
bash-lesson.tar.gz                           SRR307026_1.fastq
dmel-all-r6.19.gtf                           SRR307026_2.fastq
dmel_unique_protein_isoforms_fb_2016_01.tsv  SRR307027_1.fastq
gene_association.fb                          SRR307027_2.fastq
SRR307023_1.fastq                            SRR307028_1.fastq
SRR307023_2.fastq                            SRR307028_2.fastq
SRR307024_1.fastq                            SRR307029_1.fastq
SRR307024_2.fastq                            SRR307029_2.fastq
SRR307025_1.fastq                            SRR307030_1.fastq
SRR307025_2.fastq                            SRR307030_2.fastq

Now we have a whole bunch of example files in our directory. For this example we are going to learn a new command that tells us how long a file is: wc. The abbreviation wc stands for word count, although it can also be used to count lines, characters or bytes. wc -l file tells us the number of lines in a text file.

$ wc -l dmel-all-r6.19.gtf
542048 dmel-all-r6.19.gtf

Interesting, there are over 540000 lines in our dmel-all-r6.19.gtf file. What if we wanted to run wc -l on every .fastq file? This is where * comes in really handy! *.fastq would match every file ending in .fastq.

$ wc -l *.fastq
20000 SRR307023_1.fastq
20000 SRR307023_2.fastq
20000 SRR307024_1.fastq
20000 SRR307024_2.fastq
20000 SRR307025_1.fastq
20000 SRR307025_2.fastq
20000 SRR307026_1.fastq
20000 SRR307026_2.fastq
20000 SRR307027_1.fastq
20000 SRR307027_2.fastq
20000 SRR307028_1.fastq
20000 SRR307028_2.fastq
20000 SRR307029_1.fastq
20000 SRR307029_2.fastq
20000 SRR307030_1.fastq
20000 SRR307030_2.fastq
320000 total

That was easy. What if we wanted to do the same command, except on every file in the directory? A nice trick to keep in mind is that * by itself matches every file.

$ wc -l *
    53037 bash-lesson.tar.gz
   542048 dmel-all-r6.19.gtf
    22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
   106290 gene_association.fb
    20000 SRR307023_1.fastq
    20000 SRR307023_2.fastq
    20000 SRR307024_1.fastq
    20000 SRR307024_2.fastq
    20000 SRR307025_1.fastq
    20000 SRR307025_2.fastq
    20000 SRR307026_1.fastq
    20000 SRR307026_2.fastq
    20000 SRR307027_1.fastq
    20000 SRR307027_2.fastq
    20000 SRR307028_1.fastq
    20000 SRR307028_2.fastq
    20000 SRR307029_1.fastq
    20000 SRR307029_2.fastq
    20000 SRR307030_1.fastq
    20000 SRR307030_2.fastq
  1043504 total

Multiple wildcards

You can even use multiple *s at a time. How would you run wc -l on every file with “fb” in it?

Solution

i.e. anything or nothing then fb then anything or nothing

Using other commands

Now let’s try cleaning up our working directory a bit. Create a folder called “fastq” and move all of our .fastq files there in one mv command.

Solution

Redirecting output

Each of the commands we’ve used so far does only a very small amount of work. However, we can chain these small UNIX commands together to perform otherwise complicated actions!

For our first foray into piping, or redirecting output, we are going to use the > operator to write output to a file. When using >, whatever is on the left of the > is written to the filename you specify on the right of the arrow. The actual syntax looks like command > filename.

Let’s try several basic usages of >. echo simply prints back, or echoes whatever you type after it.

$ echo "this is a test"
$ echo "this is a test" > test.txt
$ ls
$ cat test.txt
this is a test

bash-lesson.tar.gz                           fastq
dmel-all-r6.19.gtf                           gene_association.fb
dmel_unique_protein_isoforms_fb_2016_01.tsv  test.txt

this is a test

Awesome, let’s try that with a more complicated command, like wc -l.

$ wc -l * > word_counts.txt
$ cat word_counts.txt
wc: fastq: Is a directory

    53037 bash-lesson.tar.gz
   542048 dmel-all-r6.19.gtf
    22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
        0 fastq
   106290 gene_association.fb
        1 test.txt
   723505 total

Notice how we still got some output to the console even though we “piped” the output to a file? Our expected output still went to the file, but how did the error message get skipped and not go to the file?

This phenomena is an artefact of how UNIX systems are built. There are 3 input/output streams for every UNIX program you will run: stdin, stdout, and stderr.

Let’s dissect these three streams of input/output in the command we just ran: wc -l * > word_counts.txt

Knowing what we know now, let’s try re-running the command, and send all of the output (including the error message) to the same word_counts.txt files as before.

$ wc -l * &> word_counts.txt

Notice how there was no output to the console that time. Let’s check that the error message went to the file like we specified.

$ cat word_counts.txt
    53037 bash-lesson.tar.gz
   542048 dmel-all-r6.19.gtf
    22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
wc: fastq: Is a directory
        0 fastq
   106290 gene_association.fb
        1 test.txt
        7 word_counts.txt
   723512 total

Success! The wc: fastq: Is a directory error message was written to the file. Also, note how the file was silently overwritten by directing output to the same place as before. Sometimes this is not the behaviour we want. How do we append (add) to a file instead of overwriting it?

Appending to a file is done the same was as redirecting output. However, instead of >, we will use >>.

$ echo "We want to add this sentence to the end of our file" >> word_counts.txt
$ cat word_counts.txt
  22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
 471308 Drosophila_melanogaster.BDGP5.77.gtf
      0 fastq
1304914 fb_synonym_fb_2016_01.tsv
 106290 gene_association.fb
      1 test.txt
1904642 total
We want to add this sentence to the end of our file

Chaining commands together

We now know how to redirect stdout and stderr to files. We can actually take this a step further and redirect output (stdout) from one command to serve as the input (stdin) for the next. To do this, we use the | (pipe) operator.

grep is an extremely useful command. It finds things for us within files. Basic usage (there are a lot of options for more clever things, see the man page) uses the syntax grep whatToFind fileToSearch. Let’s use grep to find all of the entries pertaining to the Act5C gene in Drosophila melanogaster.

$ grep Act5C dmel-all-r6.19.gtf

The output is nearly unintelligible since there is so much of it. Let’s send the output of that grep command to head so we can just take a peek at the first line. The | operator lets us send output from one command to the next:

$ grep Act5C dmel-all-r6.19.gtf | head -n 1
X	FlyBase	gene	5900861	5905399	.	+	.	gene_id "FBgn0000042"; gene_symbol "Act5C";

Nice work, we sent the output of grep to head. Let’s try counting the number of entries for Act5C with wc -l. We can do the same trick to send grep’s output to wc -l:

$ grep Act5C dmel-all-r6.19.gtf | wc -l
46

Note that this is just the same as redirecting output to a file, then reading the number of lines from that file.

Writing commands using pipes

How many files are there in the “fastq” directory we made earlier? (Use the shell to do this.)

Solution

Output of ls is one line per item, when chaining commands together like this, so counting lines gives the number of files.

Reading from compressed files

Let’s compress one of our files using gzip.

$ gzip gene_association.fb

zcat acts like cat, except that it can read information from .gz (compressed) files. Using zcat, can you write a command to take a look at the top few lines of the gene_association.fb.gz file (without decompressing the file itself)?

Solution

The head command without any options shows the first 10 lines of a file.

Key Points

  • The * wildcard is used as a placeholder to match any text that follows a pattern.

  • Redirect a command’s output to a file with >.

  • Commands can be chained with |

  • Understand the wc,


(BONUS) Scripts, variables, and loops

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do I turn a set of commands into a program?

Objectives
  • Write a shell script

  • Understand and manipulate UNIX permissions

  • Understand shell variables and how to use them

  • Write a simple “for” loop.

We now know a lot of UNIX commands! Wouldn’t it be great if we could save certain commands so that we could run them later or not have to type them out again? As it turns out, this is straightforward to do. A “shell script” is essentially a text file containing a list of UNIX commands to be executed in a sequential manner. These shell scripts can be run whenever we want, and are a great way to automate our work.

Writing a Script

So how do we write a shell script, exactly? It turns out we can do this with a text editor. Start editing a file called “demo.sh” (to recap, we can do this with nano demo.sh). The “.sh” is the standard file extension for shell scripts that most people use (you may also see “.bash” used).

Our shell script will have two parts:

Our file should now look like this:

#!/usr/bin/env bash

echo "Our script worked!"

Ready to run our program? Let’s try running it:

$ demo.sh 
bash: demo.sh: command not found...

Strangely enough, Bash can’t find our script. As it turns out, Bash will only look in certain directories for scripts to run. To run anything else, we need to tell Bash exactly where to look. To run a script that we wrote ourselves, we need to specify the full path to the file, followed by the filename. We could do this one of two ways: either with our absolute path /home/yourUserName/demo.sh, or with the relative path ./demo.sh.

$ ./demo.sh
bash: ./demo.sh: Permission denied

There’s one last thing we need to do. Before a file can be run, it needs “permission” to run. Let’s look at our file’s permissions with ls -l:

$ ls -l
-rw-rw-r-- 1 yourUsername tc001 12534006 Jan 16 18:50 bash-lesson.tar.gz
-rw-rw-r-- 1 yourUsername tc001       40 Jan 16 19:41 demo.sh
-rw-rw-r-- 1 yourUsername tc001 77426528 Jan 16 18:50 dmel-all-r6.19.gtf
-rw-r--r-- 1 yourUsername tc001   721242 Jan 25  2016 dmel_unique_protein_is...
drwxrwxr-x 2 yourUsername tc001     4096 Jan 16 19:16 fastq
-rw-r--r-- 1 yourUsername tc001  1830516 Jan 25  2016 gene_association.fb.gz
-rw-rw-r-- 1 yourUsername tc001       15 Jan 16 19:17 test.txt
-rw-rw-r-- 1 yourUsername tc001      245 Jan 16 19:24 word_counts.txt

That’s a huge amount of output: a full listing of everything in the directory. Let’s see if we can understand what each field of a given row represents, working left to right.

  1. Permissions: On the very left side, there is a string of the characters d, r, w, x, and -. The d indicates if something is a directory (there is a - in that spot if it is not a directory). The other r, w, x bits indicate permission to Read, Write, and eXecute a file. There are three fields of rwx permissions following the spot for d. If a user is missing a permission to do something, it’s indicated by a -.
    • The first set of rwx are the permissions that the owner has (in this case the owner is yourUsername).
    • The second set of rwxs are permissions that other members of the owner’s group share (in this case, the group is named tc001).
    • The third set of rwxs are permissions that anyone else with access to this computer can do with a file. Though files are typically created with read permissions for everyone, typically the permissions on your home directory prevent others from being able to access the file in the first place.
  2. References (not important) : This counts the number of references (hard links) to the item (file, folder, symbolic link or “shortcut”).
  3. Owner: This is the username of the user who owns the file. Their permissions are indicated in the first permissions field.
  4. Group: This is the user group of the user who owns the file. Members of this user group have permissions indicated in the second permissions field.
  5. Size of item: This is the number of bytes in a file, or the number of filesystem blocks occupied by the contents of a folder. (We can use the -h option here to get a human-readable file size in megabytes, gigabytes, etc.)
  6. Time last modified: This is the last time the file was modified.
  7. Filename: This is the filename.

So how do we change permissions? As I mentioned earlier, we need permission to execute our script. Changing permissions is done with chmod. To add executable permissions for all users we could use this:

$ chmod +x demo.sh
$ ls -l
-rw-rw-r-- 1 yourUsername tc001 12534006 Jan 16 18:50 bash-lesson.tar.gz
-rwxrwxr-x 1 yourUsername tc001       40 Jan 16 19:41 demo.sh
-rw-rw-r-- 1 yourUsername tc001 77426528 Jan 16 18:50 dmel-all-r6.19.gtf
-rw-r--r-- 1 yourUsername tc001   721242 Jan 25  2016 dmel_unique_protein_is...
drwxrwxr-x 2 yourUsername tc001     4096 Jan 16 19:16 fastq
-rw-r--r-- 1 yourUsername tc001  1830516 Jan 25  2016 gene_association.fb.gz
-rw-rw-r-- 1 yourUsername tc001       15 Jan 16 19:17 test.txt
-rw-rw-r-- 1 yourUsername tc001      245 Jan 16 19:24 word_counts.txt

Now that we have executable permissions for that file, we can run it.

$ ./demo.sh
Our script worked!

Fantastic, we’ve written our first program! Before we go any further, let’s learn how to take notes inside our program using comments. A comment is indicated by the # character, followed by whatever we want. Comments do not get run. Let’s try out some comments in the console, then add one to our script!

# This won't show anything.

Now lets try adding this to our script with nano. Edit your script to look something like this:

#!/usr/bin/env bash

# This is a comment... they are nice for making notes!
echo "Our script worked!"

When we run our script, the output should be unchanged from before!

Shell variables

One important concept that we’ll need to cover are shell variables. Variables are a great way of saving information under a name you can access later. In programming languages like Python and R, variables can store pretty much anything you can think of. In the shell, they usually just store text. The best way to understand how they work is to see them in action.

To set a variable, simply type in a name containing only letters, numbers, and underscores, followed by an = and whatever you want to put in the variable. Shell variable names are often uppercase by convention (but do not have to be).

$ VAR="This is our variable"

To use a variable, prefix its name with a $ sign. Note that if we want to simply check what a variable is, we should use echo (or else the shell will try to run the contents of a variable).

$ echo $VAR
This is our variable

Let’s try setting a variable in our script and then recalling its value as part of a command. We’re going to make it so our script runs wc -l on whichever file we specify with FILE.

Our script:

#!/usr/bin/env bash

# set our variable to the name of our GTF file
FILE=dmel-all-r6.19.gtf

# call wc -l on our file
wc -l $FILE
$ ./demo.sh
542048 dmel-all-r6.19.gtf

What if we wanted to do our little wc -l script on other files without having to change $FILE every time we want to use it? There is actually a special shell variable we can use in scripts that allows us to use arguments in our scripts (arguments are extra information that we can pass to our script, like the -l in wc -l).

To use the first argument to a script, use $1 (the second argument is $2, and so on). Let’s change our script to run wc -l on $1 instead of $FILE. Note that we can also pass all of the arguments using $@ (not going to use it in this lesson, but it’s something to be aware of).

Our script:

#!/usr/bin/env bash

# call wc -l on our first argument
wc -l $1
$ ./demo.sh dmel_unique_protein_isoforms_fb_2016_01.tsv
22129 dmel_unique_protein_isoforms_fb_2016_01.tsv

Nice! One thing to be aware of when using variables: they are all treated as pure text. How do we save the output of an actual command like ls -l?

A demonstration of what doesn’t work:

$ TEST=ls -l
-bash: -l: command not found

What does work (we need to surround any command with $(command)):

$ TEST=$(ls -l)
$ echo $TEST
total 90372 -rw-rw-r-- 1 jeff jeff 12534006 Jan 16 18:50 bash-lesson.tar.gz -rwxrwxr-x. 1 jeff jeff 40 Jan 1619:41 demo.sh -rw-rw-r-- 1 jeff jeff 77426528 Jan 16 18:50 dmel-all-r6.19.gtf -rw-r--r-- 1 jeff jeff 721242 Jan 25 2016 dmel_unique_protein_isoforms_fb_2016_01.tsv drwxrwxr-x. 2 jeff jeff 4096 Jan 16 19:16 fastq -rw-r--r-- 1 jeff jeff 1830516 Jan 25 2016 gene_association.fb.gz -rw-rw-r-- 1 jeff jeff 15 Jan 16 19:17 test.txt -rw-rw-r-- 1 jeff jeff 245 Jan 16 19:24 word_counts.txt

Note that everything got printed on the same line. This is a feature, not a bug, as it allows us to use $(commands) inside lines of script without triggering line breaks (which would end our line of code and execute it prematurely). If you use echo "$TEST" (note the quotation marks), the line breaks will be printed.

Loops

To end our lesson on scripts, we are going to learn how to write a for-loop to execute a lot of commands at once. This will let us do the same string of commands on every file in a directory (or other stuff of that nature).

for-loops generally have the following syntax:

#!/usr/bin/env bash

for VAR in first second third
do
    echo $VAR
done

When a for-loop gets run, the loop will run once for everything following the word in. In each iteration, the variable $VAR is set to a particular value for that iteration. In this case it will be set to first during the first iteration, second on the second, and so on. During each iteration, the code between do and done is performed.

Let’s run the script we just wrote (I saved mine as loop.sh).

$ chmod +x loop.sh
$ ./loop.sh
first
second
third

What if we wanted to loop over a shell variable, such as every file in the current directory? Shell variables work perfectly in for-loops. In this example, we’ll save the result of ls and loop over each file:

#!/usr/bin/env bash

FILES=$(ls)
for VAR in $FILES
do
        echo $VAR
done
$ ./loop.sh
bash-lesson.tar.gz
demo.sh
dmel_unique_protein_isoforms_fb_2016_01.tsv
dmel-all-r6.19.gtf
fastq
gene_association.fb.gz
loop.sh
test.txt
word_counts.txt

There’s a shortcut to run on all files of a particular type, say all .gz files:

#!/usr/bin/env bash

for VAR in *.gz
do
    echo $VAR
done
bash-lesson.tar.gz
gene_association.fb.gz

Writing our own scripts and loops

cd to our fastq directory from earlier and write a loop to print off the name and top 4 lines of every fastq file in that directory.

Is there a way to only run the loop on fastq files ending in _1.fastq?

Solution

Create the following script in a file called head_all.sh

The “for” line could be modified to be for FILE in *_1.fastq to achieve the second aim.

Concatenating variables

Concatenating (i.e. mashing together) variables is quite easy to do. Add whatever you want to concatenate to the beginning or end of the shell variable after enclosing it in {} characters.

FILE=stuff.txt
echo ${FILE}.example
stuff.txt.example

Can you write a script that prints off the name of every file in a directory with “.processed” added to it?

Solution

Create the following script in a file called process.sh

Note that this will also print directories appended with “.processed”. To truly only get files and not directories, we need to modify this to use the find command to give us only files in the current directory:

but this will have the side-effect of listing hidden files too.

Special permissions

What if we want to give different sets of users different permissions. chmod actually accepts special numeric codes instead of stuff like chmod +x. The numeric codes are as follows: read = 4, write = 2, execute = 1. For each user we will assign permissions based on the sum of these permissions (must be between 7 and 0).

Let’s make an example file and give everyone permission to do everything with it.

touch example
ls -l example
chmod 777 example
ls -l example

How might we give ourselves permission to do everything with a file, but allow no one else to do anything with it.

Solution

We want all permissions so: 4 (read) + 2 (write) + 1 (execute) = 7 for user (first position), no permissions, i.e. 0, for group (second position) and all (third position).

Key Points

  • A shell script is just a list of bash commands in a text file.

  • To make a shell script file executable, run chmod +x script.sh.