The chdb tutorial

How to use this tutorial ?

Each entry point of the tutorial is dedicated to a usecase, from the most general and simple to the most specific and complicated. To understand what happens when chdb is launched for a real problem we’ll use some very simple bash commands. You should :

  • read carefully the few lines before and after each code sample, to understand the point.
  • copy-and-paste each code sample provided here, execute it from the frontal node, and look at the created files using the unix standard tools as find, ls, less, etc.

Of course, in the real life, you’ll have to replace the toy command line launched by chdb by your code, and yo must launch chdb through an sbatch command, as usual. Please remember you cannot use the frontal nodes to launch real applications, the frontal nodes are dedicated to file edition or test runs.

Also in the real life, you should use srun to launch chdb program, not mpirun. Use mpirun to stay on the frontal, and srun to go to the nodes.

Introduction

Before starting...

You should be connected to Olympe before working on this tutorial. You must initialize your environment to be able to use the commands described here :

module load intelmpi chdb/1.0

You should work in an empty temporary directory, to be sure you’ll not remove important files while executing the exercises described in the following paragraphs :

mkdir TUTO-CHDB
cd TUTO-CHDB

Help !

You can ask chdb for help :

chdb --help

Essential


Executing the same code on several input files

Executing the same code on several input files

The goal of chdb is to execute the same code on a series of input files, and store the results produced by those instances.
Using chdb implies that the executions of your code are independent and that the order in which the files are treated is irrelevant.
In order to use chdb, you must put all your input files inside a common "input directory".
Let’s create 10 files under a directory called inp :

rm -r inp inp.out;
mkdir inp;
for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done

Then let’s use those 10 files as entry for a little toy program. We specify --in-type txt to tell chdb to take all *.txt files as input :

mpirun -n 2 chdb --in-dir inp \
 --in-type txt \
 --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out'

A new directory has been created, it is called inp.out and it contains all the output files. If you don’t believe me, just try those commands :

find inp.out -ls
find inp.out -type f |xargs cat

Run again the toy program : it does not work, you get the following message :

ERROR - Cannot create directory inp.out - Error= File exists
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

chdb creates himself the output directory, and refuses to run if the output directory already exists : thus you are protected from destroying by mistake previous computation results

How many processes for a chdb job ?

How many processes for a chdb job ?

chdb is an mpi application running in master-slave mode. This means that the rank 0 process is the master, it distributes the tasks to the other processes, called "slaves". Those processes are responsible for launching your code. Thus if you request n processes for your chdb job, you will have n-1 slaves running.
However, the master should not take too much cpu time, as it only sends some work to the slaves and retrieves their result. So on a 36 cores-node machine, this makes sense to request 37 processes. If you start chdb on 2 nodes, you could request 73 processes, they should be correctly placed on the cores.
Please note that the number of slaves running should be less or equal than the number of input files. For instance, the following will not work (2 input files and 3 slaves) :

rm -r inp inp.out;
mkdir inp;
for i in $(seq 1 2); do echo "coucou $i" > inp/$i.txt; done
mpirun -n  4 chdb --in-dir inp --in-type txt \
--command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out'

try again, replacing -n 4 by -n 3

specifying the output directory

specifying the output directory

If you do not provide chdb with any output specification, chdb will build an output directory name for you. But of course it is possible to change this behaviour. This is particularly useful when using chdb through slurm (which is the standard way of working), as you can integrate the job Id in the output directory name :

salloc -N 1 -n 36 --time=2:0;
mpirun -n 2 chdb --in-dir inp --in-type txt \
 --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out' \
 --out-dir $SLURM_JOB_ID
exit

Executing your code on a subset of files

Executing your code on a subset of files

May be you’ll want to execute the program only on a subset of the files found in the input directory :

Create 10 files in an input directory :

rm -r inp inp.out
mkdir inp
for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done

Create a file containing the list of files to be processed. this will be our subset :

for i in $(seq 1 2 10); do echo $i.txt >> input_files.txt; done

Execute chdb using only the subset as input :

mpirun -n 2 chdb --in-dir inp --in-type txt \
 --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out' \
 --in-files input_files.txt;

Executing your code on a hierarchy of files

Executing your code on a hierarchy of files

The h in chdb stands for hierarchy : the input files do not have to be in a "flat directory", they can be organized in a hierarchy, which is sometimes more convenient if you have a lot of files. The following commands will create a hierarchy in the input directory, execute the toy program on the .txt files, then recreate the same hierarchy in output an write here the outputfiles :

rm -r inp inp.out
mkdir -p inp/odd inp/even
for i in $(seq 1 2 9); do echo "coucou $i" > inp/odd/$i.txt;done
for i in $(seq 0 2 8); do echo "coucou $i" > inp/even/$i.txt; done
mpirun -n 2 chdb --in-dir inp --in-type txt \
 --command-line 'cat %in-dir%/%path% >%out-dir%/%dirname%/%basename%.out' \
 --out-files %out-dir%/%dirname%/%basename%.out

Please note the switch --out-files is required here, as it is the only way for chdb to know which subdirectories must be made to recreate the hierarchy in the output directory (you may just try to run the command above without specifying --out-files and see what happens)

Managing the errors

Managing the errors

The default behaviour of chdb is to stop as soon as an error is encountered, as shown here : we create 10 files, artificially provoke an error on two files, and run chdb :

rm -r inp inp.out;
mkdir inp;
for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done
chmod u-r inp/2.txt  inp/3.txt
mpirun -n 2 chdb --in-dir inp --in-type txt \
 --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out \
 2>%out-dir%/%basename%.err'
find inp.out

chdb writes an error message on the standard error, and stops working : very few output files are indeed created. We know that the problem arose in file 2.txt, so it is easy to investigate :

ERROR with external command - sts=1 input file=2.txt
ABORTING - Try again with --on-error
find inp.out
cat inp.out/2.err

Thanks to this behaviour, you’ll avoid wasting a lot of cpu time just because you had a error in some parameter. However, please note that in the following exercise the file 3.txt has the same error, but as chdb stopped before running processing this file, you are not aware of this situation. You can modify this behaviour :

rm -r inp inp.out;
mkdir inp;
for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done
chmod u-r inp/2.txt inp/3.txt
mpirun -n 2 chdb --in-dir inp --in-type txt \
 --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out \
 2>%out-dir%/%basename%.err' \
 --on-error errors.txt
find inp.out

Now all files were processed, and a file called errors.txt is created : the files for which your program returned a code different from 0 (thus considered as an error) are cited in the output file errors.txt. It is thus simple to know exactly what files were not processed, and for what reason (if the error code returned by your code is meaningful). When the problem is corrected, you’ll be able to run chdb again using only those files on input.

for more clarity we changed the file name of errors.txt to input_files.txt, but this is not required :

chmod u+r inp/2.txt  inp/3.txt
mv errors.txt input_files.txt
mpirun -n 2 chdb --in-dir inp --in-type txt \
 --command-line 'cat %in-dir%/%path% >%out-dir%/%basename%.out \
 2>%out-dir%/%basename%.err' \
 --on-error errors-1.txt \
 --out-dir inp.out-1 \
 --in-files input_files.txt

Advanced


Generating an execution report

Generating an execution report

You may ask an execution report with the switch --report  :

rm -r inp inp.out;
mkdir inp;
for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done
mpirun -n 3 chdb --in-dir inp --in-type txt \
 --command-line 'sleep 1; cat %in-dir%/%path% >%out-dir%/%basename%.out \
 2>%out-dir%/%basename%.err' \
 --report report.txt

The report gives you some information about :

  • The treatment time of every input file
  • The computation time of every slave
  • Some global data

Those informations are useful to check and improve load balancing, as explained under :

Improving the load balancing (1/2)

Improving the load balancing (1/2)

Load balancing is an important feature to be aware of, and may be to improve. You can have some information using the --report switch, as explained above.
If the work is well load balanced, all the slaves finish their job at the same time, an no cpu cycles are wasted. In the opposite, if some slaves take more time to complete as others, the faster ones have to wait for the latecomers, and you can loose a lot of cpu time.
In the following example, we have have 10 files to treat, and we launch chdb with 4 slaves. each treatment takes about 1s :

rm -r inp inp.out;
mkdir inp;
for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done
mpirun -n 5 chdb --in-dir inp --in-type txt \
 --command-line 'sleep 1; cat %in-dir%/%path% >%out-dir%/%basename%.out \
 2>%out-dir%/%basename%.err' \
 --report report.txt
cat report.txt

Here is the report file :

SLAVE    TIME(s)    STATUS    INPUT PATH
2    1.00965    0    2.txt
1    1.01282    0    1.txt
3    1.01392    0    10.txt
4    1.01529    0    3.txt
2    1.00846    0    4.txt
1    1.00863    0    5.txt
3    1.01006    0    6.txt
4    1.00955    0    7.txt
2    1.0082     0    8.txt
1    1.00727    0    9.txt
----------------------------------------------
SLAVE    N INP    CUMULATED TIME (s)
1    3    3.02872
2    3    3.02631
3    2    2.02398
4    2    2.02484
----------------------------------------------
AVERAGE TIME (s)        = 2.52596
STANDARD DEVIATION (s)  = 0.579144
MIN VALUE (s)           = 2.02398
MAX VALUE (s)           = 3.02872

It is easy to see that 2 slaves work during 2 seconds each, and 2 slaves work during 3 seconds. The first two slaves will have to wait for the last two. But you reserved 4 processors, so you’ll have to "pay" for 4x3=12 seconds, for only 10 useful seconds : 17% cpu time is wasted. In this very simple example, you can easily correct the problem, using 5 slaves instead of 4 :

rm -r inp.out
mpirun -n 6 chdb --in-dir inp --in-type txt --command-line 'sleep 1; cat %in-dir%/%path% >%out-dir%/%basename%.out 2>%out-dir%/%basename%.err' --report report.txt
cat report.txt

The report.txt file shows that this time you will have to "pay" for only 5x2=10 seconds.

SLAVE    TIME(s)    STATUS    INPUT PATH
2    1.01017    0    2.txt
3    1.01334    0    10.txt
1    1.01385    0    1.txt
5    1.01459    0    4.txt
4    1.01562    0    3.txt
2    1.00696    0    5.txt
1    1.0086     0    7.txt
3    1.01       0    6.txt
5    1.01138    0    8.txt
4    1.01139    0    9.txt
----------------------------------------------
SLAVE    N INP    CUMULATED TIME (s)
1    2    2.02245
2    2    2.01713
3    2    2.02334
4    2    2.02701
5    2    2.02597
----------------------------------------------
AVERAGE TIME (s)        = 2.02318
STANDARD DEVIATION (s)  = 0.00386248
MIN VALUE (s)           = 2.01713
MAX VALUE (s)           = 2.02701

Improving the load balancing (2/2) :

Improving the load balancing (2/2) :
In the following example, we create files of different size, then we arrange for the toy code to last more time on bigger files :

rm -r inp inp.out
mkdir inp
X='X'; for i in $(seq 1 9); do echo "coucou $i" > inp/$i.txt; X="$X$X"; for j in $(seq 1 $i); do echo $X >> inp/$i.txt; done; done
mpirun -n 3 chdb --in-dir inp --in-type txt \
 --command-line \
 'sleep $(( $(cat %in-dir%/%path%|wc -c) / 100 + 1)); cat %in-dir%/%path% >%out-dir%/%basename%.out 2>%out-dir%/%basename%.err' \
 --report report.txt

The report shows that the situation is somewhat catastrophic in terms of load balancing :

SLAVE    TIME(s)    STATUS    INPUT PATH
1    2.01111    0    1.txt
2    2.01391    0    2.txt
1    2.01003    0    3.txt
2    2.01012    0    4.txt
1    3.01058    0    5.txt
2    5.00927    0    6.txt
1    11.0105    0    7.txt
2    22.01      0    8.txt
1    48.0117    0    9.txt
----------------------------------------------
SLAVE    N INP    CUMULATED TIME (s)
1    5    66.054
2    4    31.0433
----------------------------------------------
AVERAGE TIME (s)        = 48.5486
STANDARD DEVIATION (s)  = 24.7562
MIN VALUE (s)           = 31.0433
MAX VALUE (s)           = 66.054

Files 8 et 9 are treated at the end, because the files are by default distributed to the slaves in alphabetical order. Unfortunately, the last file needs a rather long time to be processed, leading to a bad load balancing.
A more clever scenario could consist of working at first with the long jobs : the file nb 9 will be treated first (by slave nb 1), letting the other files to the other slave, hence a much better load balancing :

rm -r inp.out;
mpirun -n 3 chdb --in-dir inp --in-type txt \
 --command-line \
 'sleep $(( $(cat %in-dir%/%path%|wc -c) / 100 )); \
 cat %in-dir%/%path% >%out-dir%/%basename%.out 2>%out-dir%/%basename%.err' \
 --report report.txt \
 --sort-by-size

The load balancing is dramatically improved :

SLAVE    TIME(s)    STATUS    INPUT PATH
2    22.0131    0    8.txt
2    11.01      0    7.txt
2    5.00993    0    6.txt
2    3.0096     0    5.txt
2    2.00942    0    4.txt
2    2.00944    0    3.txt
2    2.01015    0    2.txt
1    48.0111    0    9.txt
2    2.0096     0    1.txt
----------------------------------------------
SLAVE    N INP    CUMULATED TIME (s)
1    1    48.0111
2    8    49.0813
----------------------------------------------
AVERAGE TIME (s)        = 48.5462
STANDARD DEVIATION (s)  = 0.756772
MIN VALUE (s)           = 48.0111
MAX VALUE (s)           = 49.0813

NOTE - Many codes, but not all codes, take a longer time to treat bigger files than smaller ones. If this is a feature of your code, you can try using the --sort-by-size switch. but if it is not, this will be useless.

Avoiding Input-output saturation

Avoiding Input-output saturation

While having different run times for the many code instances you launch may cause a performance issue (see above), launching a code which lasts always the same time may also be a problem : Real codes, when running, frequently need to read or write huge datafiles, generally at the beginning of the job and at the end (and sometimes during the job, between iterations). If you launch simultaneously 10 such codes with chdb, the file system may be rapidly completely saturated, because all the jobs will try to read or write data synchronously. chdb allows for you desynchronizing the jobs, as shown under :

rm -r inp inp.out;
mkdir inp;
for i in $(seq 1 10); do echo "coucou $i" > inp/$i.txt; done
mpirun -n 6 chdb --in-dir inp --in-type txt \
 --command-line \
 'cat %in-dir%/%path% >%out-dir%/%basename%.out' \
 --sleep 2

The switch --sleep n causes each slave to wait n * rank seconds before the first run, where rank is the mpi rank (the slave number). With this trick, the slaves should be desynchronized : If every slave reads or writes files at the same time relative to the start of the run, the shift in start time insures that the I/O operations are not synchronized.

Warning ! If you work with say 100 slaves, the slave nb 99 will have to wait 2x99=198 seconds before starting, which can be a problem unless the total time of execution is much more than that. We are still working on this point to improve it.

Launching an mpi program

Launching an mpi program

The code you want to launch through chdb does not have to be sequential : it can be an mpi-base program. In this case you must tell this to chdb.
To go on with the tutorial, please download this toy program. This code :

  • Prints the chdb slave number and the mpi rank number
  • Is linked to the numa library, to print the cores that could be used by each mpi process
    (if you do not understand the C code of hell-mpi.c this is not a problem, it is not needed to go on with this tutorial).

We rename the downloaded file and compile the program :

mv hello-mpi.c.txt hello-mpi.c
mpicc -o hello-mpi hello-mpi.c -lnuma

Now, let’s run our mpi code through chdb : for this exercise we use srun, in order to leave the front node and work on a compute node. We have 10 input files, and our slave needs 2 mpi processes to run. This makes sense reserving a complete node, starting 19 mpi processes :

rm -r inp inp.out;
mkdir inp;
for i in $(seq 1 18); do echo "coucou $i" > inp/$i.txt; done
srun -N 1 -n 19 chdb --in-dir inp --in-type txt \
 --command-line  './hello-mpi %name%' \
 --mpi-slaves 2

The switch —mpi-slaves tells chdb that the external code is an mpi program, and the number of processes to launch. Here is the output :

chdb - mpi
 

Controlling the placement of an mpi code

Controlling the placement of an mpi code

If you look at the output above, you’ll see that if several mpi processes belonging to the same slave do not overlap (they use different core sets), there is a huge overlap between different chdb slaves. This can be avoided, slightly modifying the chdb command.

Here with decided to start a run with the following characteristics:

  1. 18 slaves running on only one node
  2. Each slave is an mpi program using 2 processes
  3. Each process uses only 1 thread (no openmp)
rm -r inp.out
srun -N 1 -n 19 chdb --in-dir inp --in-type txt \
 --command-line  './hello-mpi %name%' \
 --mpi-slaves 18:2:1

The important switch is --mpi-slaves 18:2:1, telling chdb that :

  • We are running 18 slaves per node
  • Each slave is an mpi code using 2 mpi processes
  • Each mpi process uses 1 thread

Here is the output, this time there is no overlap between slaves :
chdb - mpi code correctly placed

Another use case: let's launch a run using 2 nodes, 4 slaves per node (thus 2x4 + 1 = 9 chdb tasks), 4 mpi processes per slave and 2 threads per mpi process. The  command becomes:

srun -N 2 -n 9  chdb ... --mpi-slaves 4:4:2

Miscellaneous


Working with directories as input

Working with directories as input

Many scientific codes prefer to work inside a current directory, reading and writing files in this directory, sometimes hardcoding file names.

This is possible with chdb, you just have to use dir as a file type (consequently dir is a reserved file type and you should not use it for ordinary files). Or course, your code may - or not - be an mpi code.

Here is a sample :

rm -r inpd; for d in $(seq 1 10); do mkdir -p inpd/$d.dir; echo "coucou $d" > inpd/$d.dir/input; done
mpirun -n 11 chdb  --in-dir inpd --in-type dir --command-line 'cat input >output'

The behaviour is somewhat different now :

  • There is no default output directory, because we now work in the input directory, which is also the output directory.
  • However, you can specify an output directory if this is convenient for you
  • While in the previous chdb runs the input directory could have been read only, this is no more the case.

Executing a code a predefined number of times

Executing a code a predefined number of times

chdb can be used to execute a code a predefined number of times, without specifying any file or directory input :

rm -r iter.out
mpirun -n 3 chdb --out-dir iter.out \
 --command-line 'echo %path% >%out-dir%/%path%' \
 --in-type "1 10"

It is not required specifying an output directory, as shown here :

rm -r iter.out
mpirun -n 3 chdb \
 --in-type "1 20 2" \
 --command-line 'echo slave nb $CHDB_RANK iteration nb %path%'

The environment variables and the file specification templates are still available in this mode of operation.

Checkpointing chdb

Checkpointing chdb

WARNING - The feature described here is still considered as EXPERIMENTAL - Feedbacks are welcome
If chdb receives a SIGTERM or a SIGKILL signals (for instance if you scancel your running job), the signal is intercepted, and the input files or iterations not yet processed are saved to a file called CHDB-INTERRUPTION.txt. The files being processed when the signal are received are considered as "not processed", because the processing will be interrupted.


It is then easy to launch chdb again, using CHDB-INTERRUPTION.txt as input.

Let’s try it using in-type iteration for more simplicity :

rm -rf iter.out iter1.out
mpirun -n 3 chdb --out-dir iter.out  --in-type "1 100" \
 --command-line 'sleep 1;echo %path% >%out-dir%/%path%'

Execute the above commands, wait about 10 seconds, then type CTRL-C. The program is interrupted, CHDB-INTERRUPTION.txt is created.

First have a look to CHDB-INTERRUPTION.txt, check the last line :

# Number of files not yet processed = 56

If this line is not present, the CHDB-INTERRUPTION.txt is not complete and it can be difficult to resume the execution of your job.

If CHDB-INTERRUPTION.txt is complete, we can launch chdb again using the --in-files switch, and this time we wait until the end :

mpirun -n 3 chdb --out-dir iter1.out  --in-type "1 100" \
 --command-line 'sleep 1;echo %path% >%out-dir%/%path%' \
 --in-files CHDB-INTERRUPTION.txt

You can now check that every file was correctly created ; the following command should display 100 :

find iter.out iter1.out -type f| wc -l

Controling the code environment (v 1.1.x)

Controling the code environment (v 1.1.x)

STILL BUILDING - DOES NOT WORK YET !

WARNING - for using this feature, please initialize the chdb environment with:

module load chdb/1.1.0

Starting with the 1.1.0 release, there is a better isolation between an mpi code executed by the chdb slaves and the chdb code himself. This is an important feature, because at Calmip we provide several flavours of the mpi library: different versions of intelmpi and different versions of openmpi.

You can completely control your program environment, regardless of the chdb environment. This paragraph explains how.

Controlling the mpi flavour:

export CHDB_MPI=openmpi

or

export CHDB_MPI=intelmpi

Declaring modules:

Suppose your code needs the modules: openmpi/gnu/2.0.2.10, gcc/5.4.0, and hdf5/1.10.2-openmpi:

export CHDB_MODULES="openmpi/gnu/2.0.2.10 gcc/5.4.0 hdf5/1.10.2-openmpi"

(the separator is: <space>)

Defining environment variables:

You can define environment variables for your code in two ways:

Giving only the name of the variables:

In this way, the environment variables should be already defined, you just have to cite their names:

export CHDB_ENVIRONMENT="OMP_NUMTHREADS OMP_PROC_BIN"

Defining the variable name and value:

In this way, the variable does not need to be already defined, you just have to define the name=value in one shot:

export CHDB_EXPORT_LIST="MY_VAR1=4 MY_VAR2=5"

Declaring code snippets:

May be you have to execution some unix commands commands before running your code. It is possible to introduce some bash fragments:

export CHDB_SNIPPET='ulimit -s 10240; source my_code_init.bash'

With those environment variables, chdb is able to "write" some script, with some instructions called just before running your code. And the environment used by chdb itself (ie gcc 10.3.0 and intelmpi) is completely forgotten when your code is running

Fichiers attachés

Voir aussi

calcul "embarrassingly parallel": codes non mpi

Cet article explique comment exécuter un programme unique sur un jeu de fichiers en entrée.

calcul "embarrassingly parallel": codes mpi

Lancer avec chdb des traitements mpi

Exécution hybride MPI et OpenMP

A travers des exemples nous montrons le moyens d’exécuter des jobs mixtes MPI+OpenMP, en attachant explicitement les processus et les threads aux cœurs physiques des nœuds.

soumission de jobs avec dépendances

On peut soumettre des jobs, qui ne partiront qu'après la fin d'exécution d'autres jobs.