Discussion:
random pool entropy starvation in initrd with parse-kickstart
Brian C. Lane
2015-11-23 23:08:20 UTC
Permalink
Trying to write down some of the things I know so that I don't forget
and so that maybe someone else can figure this out.

I've been testing with a PXE setup using rawhide boot.iso's that I've
created. I'm injecting the modified code into the initrd by adding
,/path/to/extra.img to the initrd PXE entry. This also works for adding
the kickstart at /

If parse-kickstart is run before the kernel initializes the nonblocking
random pool it appears to starve the system and it takes anywhere from
several minutes to forever for the pool to be initialized and for the
boot to continue.

There doesn't seem to be much of anything feeding the entropy pool in
the initrd. https://github.com/rhinstaller/anaconda/pull/458 waits for
some number of bits before continuing, printing what's available as it
waits. When running parse-kickstart early the pool sits at 0 pretty much
forever. When running it later it starts around 50, pool initialization
seems to happen at around 300 (even though the code in the kernel
random.c is checking for 128).

It is worse with inst.ks=file, which will hang nearly forever. Moving it
to run in the initqueue with wwoods' patch improves things, it
eventually runs so that's an improvement. inst.ks=http will eventually
run, but it may be minutes (I've seen as long as 5 in non-exhaustive
timing tests).

I tried reducing the random.py value from 2500 back to 32. This was not
a quick fix, it still delays. I also tried skipping the initialization
by setting a=None and that made no observable differece. This probably
means that there is more to this than just the python 3.5 change.

I tried adding qemu and qemu-net to the initrd with lorax, thinking that
virtio-rng would magically start feeding entropy from the host. It
didn't appear to help, but also doesn't hurt. Maybe rngd is needed? Not
sure how well that would work with running in the initrd.

I'm not sure where to go next with this, reducing/removing the random.py
init and still having it get stuck would seem to indicate some other
thing is involved, but david shea has straced it and it is getting stuck
in the python read from /dev/urandom.

Also, things boot normally if inst.ks isn't passed. pool initialization
happens at about 14 seconds.
--
Brian C. Lane | Anaconda Team | IRC: bcl #anaconda | Port Orchard, WA (PST8PDT)
John Reiser
2015-11-24 15:11:10 UTC
Permalink
Post by Brian C. Lane
Trying to write down some of the things I know so that I don't forget
and so that maybe someone else can figure this out.
I've been testing with a PXE setup using rawhide boot.iso's that I've
created. I'm injecting the modified code into the initrd by adding
,/path/to/extra.img to the initrd PXE entry. This also works for adding
the kickstart at /
If parse-kickstart is run before the kernel initializes the nonblocking
random pool it appears to starve the system and it takes anywhere from
several minutes to forever for the pool to be initialized and for the
boot to continue.
[[snip]]
How many execve()? Each consumes 16 bytes from the pool for AT_RANDOM.
A shell script that invokes helper utility programs instead of builtin
string operations is one likely culprit.
Brian C. Lane
2015-11-24 16:26:58 UTC
Permalink
Post by John Reiser
Post by Brian C. Lane
Trying to write down some of the things I know so that I don't forget
and so that maybe someone else can figure this out.
I've been testing with a PXE setup using rawhide boot.iso's that I've
created. I'm injecting the modified code into the initrd by adding
,/path/to/extra.img to the initrd PXE entry. This also works for adding
the kickstart at /
If parse-kickstart is run before the kernel initializes the nonblocking
random pool it appears to starve the system and it takes anywhere from
several minutes to forever for the pool to be initialized and for the
boot to continue.
[[snip]]
How many execve()? Each consumes 16 bytes from the pool for AT_RANDOM.
A shell script that invokes helper utility programs instead of builtin
string operations is one likely culprit.
Unknown, but mbly way oore than 1 since it is dracut running at this point.
--
Brian C. Lane | Anaconda Team | IRC: bcl #anaconda | Port Orchard, WA (PST8PDT)
John Reiser
2015-11-25 01:33:29 UTC
Permalink
[[snip]]
Post by John Reiser
Post by Brian C. Lane
If parse-kickstart is run before the kernel initializes the nonblocking
random pool it appears to starve the system and it takes anywhere from
several minutes to forever for the pool to be initialized and for the
boot to continue.
[[snip]]
How many execve()? Each consumes 16 bytes from the pool for AT_RANDOM.
A shell script that invokes helper utility programs instead of builtin
string operations is one likely culprit.
Unknown, but presumably way more than 1 since it is dracut running at this point.
Some pieces of dracut have been written to avoid execve(). [I did some work for
40network/module-setup.sh.] Other pieces have not. As you have discovered,
the result is a voracious appetite for entropy for AT_RANDOM.

One palliative is to fork() a process which uses its own AT_RANDOM
to seed a pseudo-random number generator. Then call the generator
periodically, and "donate" some of the results to the entropy pool.

Another possibility would be to rewrite dracut in python. Builtin
and library functions probably could reduce execve() by a factor.
Replace many of the dracut shell's execve() with a call to a python
function which avoids execve().
John Reiser
2015-11-25 12:59:05 UTC
Permalink
Post by John Reiser
Some pieces of dracut have been written to avoid execve(). [I did some work for
40network/module-setup.sh.] Other pieces have not. As you have discovered,
the result is a voracious appetite for entropy for AT_RANDOM.
In 2011 I implemented a feature which correlates system calls from bash
with the function name and line number of the current executing shell script.
This allowed me to locate the portions of the script that used execve,
then rewrite them to avoid execve where possible.

bash-syspose uses the LD_PRELOAD feature of GNU glibc C library, the PS4 feature
of GNU bash shell, and a compatible two-line tweak to the source of bash itself,
to trace a bash shell script by system call. An example output line on stderr is:

0.230123 ***@6 < dracut-functions:646 16223:execve("/bin/uname", ["uname", "-m"], 60 vars) = 0

where
0.230123 elapsed time in seconds since start of tracing
***@6 current executing function name and line number in script file
dracut-functions:646 file name and line number of caller of current function
16223 current process PID
execve("/bin/uname", ... ) = 0 syscall(arguments) = result [similar to strace]

bash-syspose is implemented as a small (500 lines) C-code shared library which
intercepts (via LD_PRELOAD) selected system calls that bash makes. During inter-
ception, bash-syspose evaluates the PS4 prompt and uses the result as the main
part of tracing output. A two-line tweak to the bash source makes evaluating PS4
essentially transparent to the rest of bash.
Alexander Todorov
2015-11-26 08:31:08 UTC
Permalink
Post by John Reiser
bash-syspose is implemented as a small (500 lines) C-code shared library which
Can you send a link to the source please ?

--
Alex
John Reiser
2015-11-26 15:36:18 UTC
Permalink
Post by Alexander Todorov
Post by John Reiser
bash-syspose is implemented as a small (500 lines) C-code shared library which
Can you send a link to the source please ?
http://bitwagon.com/bash-syspose-0.8.tgz

Loading...