Discussion:
NetBSD speed curiosities
(too old to reply)
John Klos
2014-04-16 19:03:14 UTC
Permalink
Raw Message
Hi,

I'm still puzzled by one thing that perhaps someone could explain. The
Dhrystone benchmark from pkgsrc should be pretty consistent due to its
simplicity. However, my 4000/30 (VLC) with NetBSD 6, which by many sources
is supposed to be 5 VUPs or so, only reports 2109 runs per second, which
is only about 1.2 times faster than the runs per second of an 11/780.

A 4000/60, which is supposed to be about 12 VUPs, gives 16305 runs/sec,
which is about 9.2 times faster than an 11/780.

Why is this? Is there some issue which makes code too large to run inside
of the CVAX SOC's 6k of cache? Is anyone running NetBSD 1.5 or another old
version who could get some more numbers?

John
Dave McGuire
2014-04-16 19:27:35 UTC
Permalink
Raw Message
Post by John Klos
I'm still puzzled by one thing that perhaps someone could explain. The
Dhrystone benchmark from pkgsrc should be pretty consistent due to its
simplicity. However, my 4000/30 (VLC) with NetBSD 6, which by many
sources is supposed to be 5 VUPs or so, only reports 2109 runs per
second, which is only about 1.2 times faster than the runs per second of
an 11/780.
A 4000/60, which is supposed to be about 12 VUPs, gives 16305 runs/sec,
which is about 9.2 times faster than an 11/780.
Why is this? Is there some issue which makes code too large to run
inside of the CVAX SOC's 6k of cache? Is anyone running NetBSD 1.5 or
another old version who could get some more numbers?
I would immediately suspect caching issues here. Have you looked at
the assembler output?

-Dave
--
Dave McGuire, AK4HZ
New Kensington, PA
John Klos
2014-04-16 20:30:04 UTC
Permalink
Raw Message
Post by Dave McGuire
I would immediately suspect caching issues here. Have you looked at
the assembler output?
No, I haven't. I know next to nothing about VAX assembly, but I suppose I
don't know enough to even say whether I can spot anything anomalous. I'll
take a look.

John
Toby Thain
2014-04-17 01:03:03 UTC
Permalink
Raw Message
Post by Dave McGuire
Post by John Klos
I'm still puzzled by one thing that perhaps someone could explain. The
Dhrystone benchmark from pkgsrc should be pretty consistent due to its
simplicity. However, my 4000/30 (VLC) with NetBSD 6, which by many
sources is supposed to be 5 VUPs or so, only reports 2109 runs per
second, which is only about 1.2 times faster than the runs per second of
an 11/780.
A 4000/60, which is supposed to be about 12 VUPs, gives 16305 runs/sec,
which is about 9.2 times faster than an 11/780.
Maybe it's me, but these ratios seem well within the fuzziness I'd
expect given all the variables (including, as Dave sagely noted, cache).
A VUP is an archetypically vague unit.

--T
Post by Dave McGuire
Post by John Klos
Why is this? Is there some issue which makes code too large to run
inside of the CVAX SOC's 6k of cache? Is anyone running NetBSD 1.5 or
another old version who could get some more numbers?
I would immediately suspect caching issues here. Have you looked at
the assembler output?
-Dave
Dave McGuire
2014-04-17 01:06:10 UTC
Permalink
Raw Message
Post by Toby Thain
Post by John Klos
I'm still puzzled by one thing that perhaps someone could explain. The
Dhrystone benchmark from pkgsrc should be pretty consistent due to its
simplicity. However, my 4000/30 (VLC) with NetBSD 6, which by many
sources is supposed to be 5 VUPs or so, only reports 2109 runs per
second, which is only about 1.2 times faster than the runs per second of
an 11/780.
A 4000/60, which is supposed to be about 12 VUPs, gives 16305 runs/sec,
which is about 9.2 times faster than an 11/780.
Maybe it's me, but these ratios seem well within the fuzziness I'd
expect given all the variables (including, as Dave sagely noted, cache).
A VUP is an archetypically vague unit.
Well yes and no...A 4000/VLC easily runs 3-4x faster than a
MicroVAX-II, side-by-side. It's fuzzy, yes, but not I think THAT fuzzy.

-Dave
--
Dave McGuire, AK4HZ
New Kensington, PA
John Klos
2014-04-17 01:24:33 UTC
Permalink
Raw Message
Maybe it's me, but these ratios seem well within the fuzziness I'd expect
given all the variables (including, as Dave sagely noted, cache). A VUP is an
archetypically vague unit.
20% might be fuzzy, but a factor of four isn't really fuzzy. An 11/780
runs at 5 MHz, and a 4000/30 at 25 MHz. Even assuming no improvements in
clocks-per-instruction between the two, the results don't make sense.

The real questions are why is it that Digital got 19105 dhrystones a
second on the VLC and and 30120 on the 4000/60, but NetBSD gets 2109 (11%)
and 16305 (54%) of DEC speeds, and why is the difference between the two
so different?

John
Matt Thomas
2014-04-16 21:04:12 UTC
Permalink
Raw Message
Hi,
I'm still puzzled by one thing that perhaps someone could explain. The Dhrystone benchmark from pkgsrc should be pretty consistent due to its simplicity. However, my 4000/30 (VLC) with NetBSD 6, which by many sources is supposed to be 5 VUPs or so, only reports 2109 runs per second, which is only about 1.2 times faster than the runs per second of an 11/780.
A 4000/60, which is supposed to be about 12 VUPs, gives 16305 runs/sec, which is about 9.2 times faster than an 11/780.
Why is this? Is there some issue which makes code too large to run inside of the CVAX SOC's 6k of cache? Is anyone running NetBSD 1.5 or another old version who could get some more numbers?
Try an -Os version and see if it's different.
John Klos
2014-04-16 21:39:29 UTC
Permalink
Raw Message
Post by Matt Thomas
Try an -Os version and see if it's different.
I tried -O0, -Os, -O2, -O4, -fno-unroll-loops - all are within 15% of the
sizes of one another (about 11,000 bytes) and all are within 2% of one
another with regards to speed.

John
Toby Thain
2014-04-17 01:07:08 UTC
Permalink
Raw Message
Post by John Klos
Hi,
I'm still puzzled by one thing that perhaps someone could explain. The
Dhrystone benchmark from pkgsrc should be pretty consistent due to its
Oh, also, DHRYSTONE IS A TERRIBLE AWFUL BENCHMARK, and mostly
discredited. Its problem is that it is TOO simple, i.e. wildly subject
to the behaviour of particular operations rather than a good mix, and
easily gamed (many vendors did).

Please find a better integer benchmark (I came to like fhourstones, but
I am not sure how much of a better mix that represents). Most hardware
benchmark papers publish at least half a dozen different benchmarks in
order not to be skewed by Dhrystone-type effects.

--Toby
Post by John Klos
simplicity. However, my 4000/30 (VLC) with NetBSD 6, which by many
sources is supposed to be 5 VUPs or so, only reports 2109 runs per
second, which is only about 1.2 times faster than the runs per second of
an 11/780.
A 4000/60, which is supposed to be about 12 VUPs, gives 16305 runs/sec,
which is about 9.2 times faster than an 11/780.
Why is this? Is there some issue which makes code too large to run
inside of the CVAX SOC's 6k of cache? Is anyone running NetBSD 1.5 or
another old version who could get some more numbers?
John
John Klos
2014-04-17 01:27:53 UTC
Permalink
Raw Message
Hi,
Oh, also, DHRYSTONE IS A TERRIBLE AWFUL BENCHMARK, and mostly discredited.
Its problem is that it is TOO simple, i.e. wildly subject to the behaviour of
particular operations rather than a good mix, and easily gamed (many vendors
did).
Please find a better integer benchmark (I came to like fhourstones, but I am
not sure how much of a better mix that represents). Most hardware benchmark
papers publish at least half a dozen different benchmarks in order not to be
skewed by Dhrystone-type effects.
It's not about benchmarking different systems - it's about benchmarking
the exact same system with different operating systems or different
versions of the same operating system.

This is precisely where we want something that's too simple - we want to
look at why this simple bit of c code is a fraction of the speed
under modern NetBSD.

John
Toby Thain
2014-04-17 02:07:33 UTC
Permalink
Raw Message
Post by John Klos
Hi,
Post by Toby Thain
Oh, also, DHRYSTONE IS A TERRIBLE AWFUL BENCHMARK, and mostly
discredited. Its problem is that it is TOO simple, i.e. wildly subject
to the behaviour of particular operations rather than a good mix, and
easily gamed (many vendors did).
Please find a better integer benchmark (I came to like fhourstones,
but I am not sure how much of a better mix that represents). Most
hardware benchmark papers publish at least half a dozen different
benchmarks in order not to be skewed by Dhrystone-type effects.
It's not about benchmarking different systems - it's about benchmarking
the exact same system with different operating systems or different
versions of the same operating system.
This is precisely where we want something that's too simple - we want to
look at why this simple bit of c code is a fraction of the speed under
modern NetBSD.
Hm, okay, I see your point. On the *same system* it could serve that
purpose.

Be very wary though - iirc one of the contemporary criticisms (on Usenet
maybe?) was that it tested unspecified library code rather than its own
code. If you suspect a library regression why not test that directly?

--Toby
Post by John Klos
John
David Brownlee
2014-04-19 20:22:43 UTC
Permalink
Raw Message
Hi,
Oh, also, DHRYSTONE IS A TERRIBLE AWFUL BENCHMARK, and mostly discredited.
Its problem is that it is TOO simple, i.e. wildly subject to the behaviour
of particular operations rather than a good mix, and easily gamed (many
vendors did).
Please find a better integer benchmark (I came to like fhourstones, but I
am not sure how much of a better mix that represents). Most hardware
benchmark papers publish at least half a dozen different benchmarks in order
not to be skewed by Dhrystone-type effects.
It's not about benchmarking different systems - it's about benchmarking the
exact same system with different operating systems or different versions of
the same operating system.
This is precisely where we want something that's too simple - we want to
look at why this simple bit of c code is a fraction of the speed under
modern NetBSD.
Following up a suggestion from Thor in the other thread - have you
tried linking it statically and comparing the numbers.

Another interesting number might be from using NetBSD 1.2 in simh and
to build a static binary and then test that on your main machine..
James K. Lowden
2014-04-21 15:16:29 UTC
Permalink
Raw Message
On Sat, 19 Apr 2014 21:22:43 +0100
Post by David Brownlee
Following up a suggestion from Thor in the other thread - have you
tried linking it statically and comparing the numbers.
Another interesting number might be from using NetBSD 1.2 in simh and
to build a static binary and then test that on your main machine..
Just as a data point, it isn't necessary to go back to 1.2 for decent
performance on VAX.

I ran NetBSD 1.6 when it was current on a 4000/60 with no trouble. I
brought it to the Linux Expo in NYC one year, where it ran dillo as a
demonstration.

No custom kernel, just whatever binaries I found on ftp. I didn't know
how to cross-compile, so I probably built any applications natively,
relying on the defaults, with pkgsrc.

--jkl
John Klos
2014-04-21 17:21:13 UTC
Permalink
Raw Message
Suddenly something is added that makes hardclock take 5000 instructions
instead. On the 1MIPS machine we now only have 50% of the CPU left, but
on the newer machine we still have 99.9% of the CPU left (== not
noticeable). Something like this can only be seen on a really slow
machine, it's most likely not possible to find it even with fine-grain
profiling on a modern PC.
This would explain the disparities I'm seeing between 4000/30 and 4000/60
The real questions are why is it that Digital got 19105 dhrystones a
second on the VLC and and 30120 on the 4000/60, but NetBSD gets 2109
(11%) and 16305 (54%) of DEC speeds, and why is the difference between
the two so different?
I do have a 4000/90a, but haven't been running it while waiting for some
replacement SIMM sockets (some pins have become corroded). Three data
points on one version of the OS isn't much, so I'm curious about what
others have.

Reducing overhead doesn't just help the VAX port - I'm building embedded
products and one of the platforms may be a 400 MHz MIPS system with 16
megs of memory, in which case every little bit counts.

John

emanuel stiebler
2014-04-17 05:55:28 UTC
Permalink
Raw Message
Post by John Klos
Hi,
Why is this? Is there some issue which makes code too large to run
inside of the CVAX SOC's 6k of cache? Is anyone running NetBSD 1.5 or
another old version who could get some more numbers?
either something is terribly wrong with the code, or your cache is
switched off?
Any chance to check the diagnostics on the machine?
Any messages during boot, that it is disabled?
Loading...