Blog from March, 2017

This post outlines how to run a bash function in parallel using xargs. (Note, you could optionally use "parallel" instead of xargs, but I see no advantages/disadvantages at this stage)

It may not be the best way, or the right way, and it may have unforeseen consequences, so I'd welcome any feedback on better practice.

Rationale

We often run scripts with for or while loops. In the simplest case if the operation within the loop is self contained, it's very easy to make it parallel.

E.g.

# written in confluence, might not actually run
for file in *.csv ; do
  cat $file | csv-slow-thin > ${file%.csv}.processed.csv
done

Becomes:

# written in confluence, might not actually run
echo 'file=$1;cat $file | csv-slow-thin > ${file%.csv}.processed.csv' > do-slow-thing-imp
chmod 777 do-slow-thing-imp
cat *.csv | xargs -n1 -P8 do-slow-thing-impl
rm do-slow-thing-imp

But it's clunky to write a script file like that.

Better to make a function as follows, but the specific method in the code block below doesn't work

# written in confluence, might not actually run
function do-slow-thing
{
	file=$1
	cat $file | csv-slow-thin > ${file%.csv}.processed.csv
}
cat *.csv | xargs -n1 -P8 do-slow-thing #but this doesn't work

The following is the current best solution I'm aware of:

Note: set -a could be used to automatically export all subsequently declared vars, but it has caused problems with my bigger scripts

Note: set -a might have platform specific functionality. On Dmitry's machine it exports vars and functions, whereas on James' machine it exports vars only

Note: the use of declare -f means you don't need to work out a priori which nested functions may be called (e.g. like errcho in this example)

#!/bin/bash
export readonly name=$( basename $0 )
function errcho { (>&2 echo "$name: $1") }

export readonly global_var=hello
function example_function
{
        passed_var=$1
        errcho "example_function: global var is $global_var and passed var is $passed_var"
}

errcho "first run as a single process"
example_function world

errcho "run parallel with xargs"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f) ; example_function {}"

Note: if using comma_path_to_var, you can use --export to export all of the parsed command line options

No need to read beyond this point, unless you want to see the workings that lead up to this, including options that don't work.

 

The problem exposed and the solution

The following code is tested, try it, by copying into a script and running the script

#!/bin/bash

name=$( basename $0 )
function errcho { (>&2 echo "$name: $1") }

global_var=hello
function example_function
{
        passed_var=$1
        errcho "example_function: global var is $global_var and passed var is $passed_var"
}

errcho "first run as a single process"
example_function world

Single process works fine, output:

xargs_from_func: first run single threaded

xargs_from_func: example_function: global var is hello and passed var is world

Lets try multiple processes with xargs. Add the following line to the end of the script:

errcho "run parallel with xargs, attempt 1"
(echo oranges; echo apples) | xargs -n1 -P2 example_function

The problem is that example_function is not an executable:

xargs_from_func: run parallel with xargs, attempt 1

xargs: example_functionxargs: example_function: No such file or directory

: No such file or directory

Instead, let's run "bash" which is an executable:

errcho "run parallel with xargs, attempt 2"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "example_function {}"

The new bash process doesn't know the function:

xargs_from_func: run parallel with xargs, attempt 2

bash: example_function: command not found

bash: example_function: command not found

So let's declare it:

errcho "run parallel with xargs, attempt 3"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f example_function) ; example_function {}"

Getting close, but our example_function refers to another of our functions, which also needs to be declared:

xargs_from_func: run parallel with xargs, attempt 3

bash: line 3: errcho: command not found

bash: line 3: errcho: command not found

We can do that one by one, or declare all our functions in one go:

errcho "run parallel with xargs, attempt 4"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f example_function) ; $(declare -f errcho) ; example_function {}"

errcho "run parallel with xargs, attempt 5"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f) ; example_function {}"

The function itself is now working, but all the global variables are lost (including "global_var" and also the script name:

xargs_from_func: run parallel with xargs, attempt 4

: example_function: global var is  and passed var is oranges

: example_function: global var is  and passed var is apples

xargs_from_func: run parallel with xargs, attempt 5

: example_function: global var is  and passed var is oranges

: example_function: global var is  and passed var is apples

We can add these explcitly, one by one, e.g.:

errcho "run parallel with xargs, attempt 6"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f) ; global_var=$global_var ; example_function {}"

...but it's extremely hard to work out which functions call which functions and which of all functions called use which global variables.

Leads to very hard to trace bugs in real world examples.

xargs_from_func: run parallel with xargs, attempt 6

: example_function: global var is hello and passed var is oranges

: example_function: global var is hello and passed var is apples

So the final solution I've arrived at is to pass everything through by using "set":

errcho "run parallel with xargs, attempt 7"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(set) ; example_function {}"

This spits out a lot of extra garbage because it includes an attempt to reallocate readonly variables:

xargs_from_func: run parallel with xargs, attempt 6

bash: line 1: BASHOPTS: readonly variable

bash: line 1: BASHOPTS: readonly variable

bash: line 8: BASH_VERSINFO: readonly variable

bash: line 8: BASH_VERSINFO: readonly variable

bash: line 38: EUID: readonly variable

bash: line 38: EUID: readonly variable

bash: line 68: PPID: readonly variable

bash: line 79: SHELLOPTS: readonly variable

bash: line 87: UID: readonly variable

bash: line 68: PPID: readonly variable

bash: line 79: SHELLOPTS: readonly variable

xargs_from_func: example_function: global var is hello and passed var is oranges

bash: line 87: UID: readonly variable

xargs_from_func: example_function: global var is hello and passed var is apples

...but notice that it did work.

For some reason the following doesn't hide the readonly errors:

errcho "run parallel with xargs, attempt 7"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(set) > /dev/null ; example_function {}"

...and I've tried various combos of putting the dev/null inside the $(), and redirection of stderr.

I think, therefore, the best approach is to explictly declare each global using export, and to either explicitly export each function, or use the "declare -f" statement at the xargs call

That looks like this:

#!/bin/bash
export readonly name=$( basename $0 )
function errcho { (>&2 echo "$name: $1") }

export readonly global_var=hello
function example_function
{
        passed_var=$1
        errcho "example_function: global var is $global_var and passed var is $passed_var"
}

errcho "first run as a single process"
example_function world

errcho "run parallel with xargs"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f) ; example_function {}"

Note: the readonly is not strictly necessary for this example, but is good practice if it is a readonly variable.

The whole script together:

#!/bin/bash

name=$( basename $0 )
function errcho { (>&2 echo "$name: $1") }

global_var=hello
function example_function
{
        passed_var=$1
        errcho "example_function: global var is $global_var and passed var is $passed_var"
}

errcho "first run as a single process"
example_function world


errcho "run parallel with xargs, attempt 1"
(echo oranges; echo apples) | xargs -n1 -P2 example_function

errcho "run parallel with xargs, attempt 2"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "example_function {}"

errcho "run parallel with xargs, attempt 3"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f example_function) ; example_function {}"

errcho "run parallel with xargs, attempt 4"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f example_function) ; $(declare -f errcho) ; example_function {}"

errcho "run parallel with xargs, attempt 5"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f) ; example_function {}"

errcho "run parallel with xargs, attempt 6"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(declare -f) ; global_var=$global_var ; example_function {}"

errcho "run parallel with xargs, attempt 7"
(echo oranges; echo apples) | xargs -n1 -P2 -i bash -c "$(set) ; example_function {}"


 

 

 

 

 

Some external references:

http://stackoverflow.com/questions/1305237/how-to-list-variables-declared-in-script-in-bash

http://stackoverflow.com/questions/11003418/calling-functions-with-xargs-within-a-bash-scrip

Flush when you (sig)pipe

This blog entry described a pretty subtle bug that leads to unexpected behaviour when handling PIPE signal in C++.

First, a brief reminder of how and when the PIPE signal is used. Assume we have a pipeline of commands:

pipeline
command | head -n 2

The commands generate and process textual output, but we take only the first 2 lines. Once head received two lines of output, it terminates. On the next write standard output of command has no recipient, so the operating system sends a PIPE signal to command and terminates it. The point to note is that the signal is sent only when a command attempts to write something to standard output. No signal is sent and the command keeps running as long as it stays silent, e.g.

no output, no signal
time { sleep 10 | sleep 1; }

runs for 10 seconds, even if the "recipient" of the output exits after only 1 second.

Puzzle

With this pattern in mind, consider the following snippet of code:

Loop with signal handling
std::string line;
line.reserve( 4000 );
try
{
    signal_flag is_shutdown;
    command_line_options options( ac, av, usage );
    char delimiter = options.value( "--delimiter", ',' );
    bool flush = options.exists( "--flush" );
    comma::csv::format format( av[1] );
    while( std::cin.good() && !std::cin.eof() )
    {
        if( is_shutdown ) { std::cerr << "csv-to-bin: interrupted by signal" << std::endl; return -1; }
        std::getline( std::cin, line );
        if( !line.empty() && *line.rbegin() == '\r' ) { line = line.substr( 0, line.length() - 1 ); } // windows... sigh...
        if( !line.empty() ) { format.csv_to_bin( std::cout, line, delimiter, flush ); }
    }
    return 0;
}

The code is copied from the csv-to-bin utility at git revision c2521b3d83ee5f77cb1edf3fe7d42b767b4a392b. The exact details of the signal_flag class are not relevant, it suffices to say that on receipt of INT, TERM, and PIPE signals it would evaluate to logical "true" and then return to normal execution from the place where the signal was received. If you want to follow the problem hands-on, checkout the code as git checkout c2521b3. To return the code to the current (HEAD) revision, run git checkout master.

Now consider the following script using csv-to-bin:

script
#!/bin/bash

for n in {0..9}; do
    sleep 2
    echo "$0: output $n" >&2
    echo ">>>",$n | csv-to-bin s[3],ui || { echo "output failed, $?" >&2; exit 1; }
done

Let us invoke the script in the following pattern:

usage pattern
./count-bin.sh | csv-from-bin s[3],ui --delimiter=' ' | head -n 2

The expected sequence of events is:

  1. initially we see lines "./count-bin.sh: output 0" from the script itself (on the standard error) and ">>> 0" from csv-from-bin on standard output
  2. after two iterations (two lines on standard output), head terminates
  3. when csv-from-bin attempts to write its output on the next iteration (counter n is 3), the pipe is closed and there is no recipient; therefore, csv-from-bin receives a PIPE signal and terminates; we shall see output from the script itself (on standard error) but no line ">>> 2" on standard output
  4. finally, on the next iteration there is no recipient for the output from the script itself, and therefore, csv-to-bin shall receive a PIPE signal and terminate with the "interrupted by signal" message, the script shall write the "output failed" message and exit

So far so good. The actual output, however, is:

wrong output
./count-bin.sh: output 0
>>> 0
./count-bin.sh: output 1
>>> 1
./count-bin.sh: output 2
./count-bin.sh: output 3
./count-bin.sh: output 4
./count-bin.sh: output 5
./count-bin.sh: output 6
./count-bin.sh: output 7
./count-bin.sh: output 8
./count-bin.sh: output 9

The script keeps running, csv-to-bin apparently never receives SIGPIPE, although the head and csv-from-bin processes are gone (can be confirmed by looking at the process tree from a separate terminal).

So, what went wrong?

Explanation

The standard output is (by default) buffered. Therefore, no actual write is made in the main loop of csv-to-bin (unless '–flush' option is used or the buffer is full, which does not happen in our example). Therefore, nothing is written to standard output within the loop itself, and no signal is sent.

Once all the input is processed, the main loop terminates and proceeds to the "return 0" line. Again, nothing is written yet and no signal sent.

Finally, the main function exits. At this point, C++ invokes the destructors of all the global objects including the output streams, and finally the output is written. This is the time when csv-to-bin encounters the lack of output recipient and gets a PIPE signal. However, by this time we are well out of the userland code. The signal is received but no action can be made out of it. For the end-user it looks like csv-to-bin receives a signal and ignores it, exiting with the status of 0, which is already set by "return 0" before receiving the signal.

From the point of view of count-bin.sh script, csv-to-bin call was a success, and therefore, the script keeps running contrary to what we expected to achieve by using "head -n 2".

Solution

Depending on your requirements, any of the following approaches can be used:

Do not handle PIPE signal

This is the simplest way and it has been implemented in the current version of csv-to-bin and other comma applications. If no user handler is set for SIGPIPE, the default behaviour applies and on receipt of SIGPIPE a program terminates with exit status of 141. Unless the user must do something really special on receiving the signal, e.g., write a log file, sync a database, and so on, there is no need to handle PIPE (or any other signal for that matter) explicitly.

Flush after yourself

Nuff said. If you do need to handle SIGPIPE, make sure that every output is flushed (or not buffered in the first place). The flush will trigger a PIPE signal if no-one reads your output. Note that performance may be badly affected by this approach.

Kill yourself

Change the signal handler to perform the necessary last-minute action after receiving SIGPIPE, then re-send the signal to itself. In this case, the utility will also terminate with exit status of 141.

Restore the default signal handler

The custom signal handler is instantiated in the constructor of signal_flag object. Once the object is out of scope, it shall restore the default handler. This shall be the default implementation but has not been done yet. This approach is more appropriate for longer-running applications that must handle signals during some special sections of the code. Once out of the special section, the default handler shall apply. The special handler shall perform the necessary last-minute actions and then re-send the signal to the application.