Flevo CFD

Concurrent Processing with xargs and parallel

You might have encountered situations where you needed to execute an operation or a program on multiple files.

The first potential issue is that the program may only accept one file as input, such as mv which takes only one input and produces one output.

Now, assuming our program accepts a list of files, the second potential issue is that it might process the files sequentially rather than concurrently, like the tar compression program.

To address the above issues, we can use programs that take a list of inputs and perform our desired operations on them concurrently.

In this post, I want to introduce two excellent programs in this regard.

xargs

Now, for example, let’s say we want to find files with the extension txt and append .log to each of them.

The first step is to use the find command to locate our desired files:

1
find . -type f -name '*.py'

The output of this command is a list of files. Now, we need to send this list as arguments to xargs using the pipe operator |:

By default, when xargs receives input from stdin (using |), it has two rules to detect arguments:

Any input separated by a space from the next input is considered one argument. Each input on a new line is considered one argument. Now, there might be spaces in file names, causing issues. To solve this problem, we add the -print0 switch to find, which means the output should be separated by null characters. Then, we use the -0 switch with xargs to specify that arguments are separated by null characters, and the -n 1 switch to process one input at a time:

Important switches

1
2
3
4
-print0 Output is separated by null characters.
-0 Arguments are separated by null characters.
-n Number of items to process at a time.
-P Number of parallel processes (default is 1).

Example

1
find . -type f -name '*.txt' -print0 | xargs -0 -n 1 -P 32 mv {} {}.log

Here, -n 1 means it reads only one input at a time from find. We use {} on the xargs side to capture the argument, and if you want to use a different variable, you can use the -I% switch to specify it.

Example with -I%

1
find . -type f -name '*.txt' -print0 | xargs -0 -n 1 -I% -P 32 mv % %.log

Parallel

This program has more features compared to xargs

  • Ability to run operations on multiple machines
  • Display progress percentage.
  • Reporting on operations.
  • Setting time constraints for each operation.
  • Specifying retry attempts for each operation if not successful.
  • Better management of operations.

Important switches

1
2
3
4
5
-0 Specifies that arguments are separated by null characters.
-j or -P Number of parallel processes (default is the number of processor cores).
-n Maximum number of arguments the program reads at a time.
--bar or --progress Display progress bar.
--eta Display estimated time of completion.

Example

1
find . -type f -name '*.txt' -print0 | parallel -0 -n 1 -P 32 --bar --progress --eta mv {} {}.log