18-MAR-2024
After that blown cap, there have been no further incidents. I put the memory back in and switch the system on. The diagnostic display on the front panel goes from FF to FD and stops.
DEC have traditionally used 8-bit codes to indicate the power-on sequence progress. On most systems they were displayed on eight red or amber LEDs at the back. In the DEC 3000 AXP family, however, the higher end models 500/800/900 are equipped with a rather fancy Lights and Switch Module (LSM), part no. 54–21145–02 (or –01 in rack mount variants), with a two digit hexadecimal LED display. The hex digit indicators look like TIL311 or DIS1417.
FD means ‘no memory found’. The advice given in the manual is to re-seat the SIMMs. Which is what I’ve already done. As I then read on the internet, mere re-seating is often not enough and the contacts and connectors should be properly cleaned with isopropyl alcohol.
19-MAR-2024
I popped into my local DIY shop for some IPA (isopropyl alcohol, that is) but found none. Just as I was about to leave, I spotted a bottle of sanitiser on sale for £1 (cash only!). From what I remember, IPA is a common ingredient in sanitisers — worth a shot, for a quid anyway.
I pull out all the memory motherboards and SIMMs and give them a thorough clean with a cotton swab soaked in sanitiser. This doesn’t help much, and the machine is again stuck at FD.
At this stage I need access to the diagnostic serial port on the system board to find out what’s going on.
The Alpha chip uses just three pins to bootstrap itself. They are connected to a serial ROM (SROM). When the processor goes out of reset, the contents of the SROM are loaded directly into the instruction cache and then executed from there. This code then configures the memory and caches, loads the main System ROM from Flash into memory, and transfers control to it.
Once the contents of the serial ROM have been loaded into the instruction cache, the clock and data signals for the SROM become simple I/O pins. The SROM code uses them to implement a bit-bashed serial interface at 9600 or 19 200 baud, because the normal serial ports are provided by the Serial Communications Controller, which is not yet available at this early stage. The buffered signals are routed to the 2x5-pin connector J11 on the system board, where they can be used with any TTL level (5V-tolerant) USB–UART adapter cable or similar device for early power-on diagnostics.
Two pins at the bottom are connected to the 3.45 V regulator module (probably for voltage monitoring). The pinout isn’t really documented anywhere. Now, a variant of this port used on AlphaStation 200 is somewhat documented, but there is no guarantee it would be the same in the DEC 3000 AXP. Besides, the pin numbering in that document is unusual and I can't guess which pin is where.
After a bit of poking about I discover the pinout.
GND | 5 | 10 | SROMCDAT/RX |
GND | 4 | 9 | BSROMCLK/TX |
+12 V | 3 | 8 | +5 V |
shorted across to pin 7 | 2 | 7 | shorted across to pin 2 |
+3.45 V | 1 | 6 | GND |
Since I don’t want to disturb anything related to the power supply, I leave the existing socket connector of the 3.45 V regulator in place. The remaining eight holes in it are empty, and I put four regular jumper wires through them and onto pins 4, 8, 9, and 10 of the J11 connector.
The machine shouldn’t be powered up for any substantial amount of time with the side panel removed.
Luckily, my 20 cm jumper leads are just about long enough to make it through one of the holes in the top of the chassis, so I can put the side panel back in place.
Time to turn on the power and see if there is anything SROM can tell us.
DEC 3000 - M800 SROM 6.1 Powerup Sequence ff.fd. Seq/PC fd000000.000017a0 *** No usable memory detected *** Default Mem Cfg: Banks 0 and 6 = 8MB, both mapped to addr 0. MCRstat 11411111.11151145 bnkSize 00000000.00000000 memSize 00000000.00000000 SROM>
Yay! This is the first real sign of life from this system. At least the processor is alive. The memory is not, though, and the SROM code stops at FD and jumps to its ‘miniconsole’ showing the SROM> prompt.
Like the pinout of its diagnostic connector, the miniconsole isn’t documented either. I found something on the internet which describes a much later SROM for a different system. A few commands work but most don’t, including ‘mt’ for memory test.
20-MAR-2024
The “serial” ROM in this system is actually implemented on an 8-bit parallel UV-erasable programmable ROM chip 27C512. It is organised as 8 jumper-selectable bit streams. Stream 0 contains the normal boot code; the others are ‘for manufacturing use’.
I shift the jumper through the other positions to check what else is in this SROM. Unfortunately, these systems do not come with a RESET button; instead, there’s the HALT button on the front panel, but it doesn’t do anything until the console software has been loaded from the System ROM (aka Flash). So it’s a power cycle every time I want to run another SROM image. Here’s what I’ve found:
Image | Jumper | Description |
0 | J8 | Powerup Sequence |
1 | J7 | Mini-Console at 19200 baud |
2 | J6 | Mini-Console at 9600 baud |
3 | J5 | Cache Test (longword) |
4 | J4 | Mfg Test – bctest |
5 | J3 | Empty (no output) |
6 | J2 | LongWord Memory test (no cache) |
7 | J1 | LongWord Memory test (cache on) |
The baud rate is 9600 except for image 1. 'bctest' must be referring to the 2 MB write-back backup cache, or Bcache, located on the System Module.
Image 6 looks like what I’m after. I put the jumper on J2 and switch the system on for a closer look. The test displays F0 on the front panel and continuously spews out memory errors:
DEC 3000 - M800 SROM 6.1 Mfg Test ff.fd. Seq/PC fd000000.00001388 *** No usable memory detected *** Default Mem Cfg: Banks 0 and 6 = 8MB, both mapped to addr 0. MCRstat 11411111.11151145 bnkSize 00000000.00000000 memSize 00000000.00000000 fb.f0. MCRstat 11411111.11151145 bnkSize 00000000.00000000 memSize 00000000.00000008 memTest (no-cache) LongWord Memory Test address:407ffdec wrote:ffffffff read:00000000 address:407ffde8 wrote:ffffffff read:00000000 address:407ffdd4 wrote:ffffffff read:00800000 address:407ffdd0 wrote:ffffffff read:00800000 address:407ffdcc wrote:ffffffff read:00800000 address:407ffdc8 wrote:ffffffff read:00800000 address:407ffdb4 wrote:ffffffff read:bb1c44cc address:407ffdb0 wrote:ffffffff read:bb1c44cc address:407ffdac wrote:ffffffff read:00800000 address:407ffda8 wrote:ffffffff read:00800000 address:407ffd94 wrote:ffffffff read:00000000 address:407ffd90 wrote:ffffffff read:00000000 address:407ffd8c wrote:ffffffff read:00800000 address:407ffd88 wrote:ffffffff read:00800000 address:407ffd74 wrote:ffffffff read:44e33b73 address:407ffd70 wrote:ffffffff read:44e33b73 address:407ffd6c wrote:ffffffff read:00800000 address:407ffd68 wrote:ffffffff read:00800000(and so on).
This looks like a lot of errors, but there is pattern to those faulty addresses. Leaving aside the 407ffd prefix for the moment and looking at the last two digits in binary, here’s what we get:
Bad | Good |
---|---|
ec = 1110 1100 | |
e8 = 1110 1000 | |
e4 = 1110 0100 | |
e0 = 1110 0000 | |
dc = 1101 1100 | |
d8 = 1101 1000 | |
d4 = 1101 0100 | |
d0 = 1101 0000 | |
cc = 1100 1100 | |
c8 = 1100 1000 | |
c4 = 1100 0100 | |
c0 = 1100 0000 | |
bc = 1011 1100 | |
b8 = 1011 1000 | |
b4 = 1011 0100 | |
b0 = 1011 0000 |
Half of the memory is faulty! And it’s spread across the data bus so that no memory can be used. Let’s see which address bits differentiate good memory from bad:
xxx01x00 – bad
xxx00x00 – good
xxx11x00 – good
xxx10x00 – bad
(The ‘x’ bits can be either 0 or 1.)
This pattern persists throughout the output of the memory test, so we seem to be onto something. Let’s review the memory organisation in the Model 800.
The memory bus width is 256 data bits (plus 56 ECC bits). Within a bank, 8 SIMMs are arranged in parallel to form an 8x32 = 256 bit data bus. This means that the five least significant bits of an address select some of the 32 byte lanes on the bus, and higher order bits are used to select a bank and to form an address that goes to all the SIMMs in that bank. In all likelihood, the address is partitioned like this:
Address bits | ||
---|---|---|
29 … 5 | 4 3 2 | 1 0 |
Bank selector and SIMM address | SIMM selector in bank | Byte lane in SIMM |
The memory test operates on four-byte longwords, so the two least significant bits of all addresses shown above are 00. Looking at our table of good and bad addresses, we can see that bits 4-3-2 suggest that memory modules 0, 1, 6, and 7 are good, and modules 2, 3, 4, and 5 are not quite.
According to this diagram, the two inwards facing MMBs are somehow out of order. Perhaps those long connectors where MMBs mate with the System Module are still oxidised. After all, they weren’t easy to reach with a cotton swab and sanitiser, especially the sockets on the MMBs. And the sanitiser, as it turned out, doesn’t list any IPA on its label.
21-MAR-2024
I venture again to Wickes in search of isopropyl alcohol. Eventually I spot this WD-40 Specialist Contact Cleaner in the automotive section for £7.80.
Starting with MMB1 on the right hand side, I pull it out and place onto the anti-static mat. Both male and female connectors get drenched in the contact cleaner as I spray it onto every pin and into every receptacle. I also mate them three times to spread the liquid before it has evaporated. The SIMMs and their sockets receive some thorough soaking as well.
Before trying the memory test again, I swap this MMB with its neighbour, which previously showed no errors. In case my cleaning has been unsuccessful, this will help me understand whether the fault is on MMB1 or on the System Module. Here’s what I get now:
DEC 3000 - M800 SROM 6.1 Mfg Test ff.fd. Seq/PC fd000000.00001388 *** No usable memory detected *** Default Mem Cfg: Banks 0 and 6 = 8MB, both mapped to addr 0. MCRstat 11411111.11151145 bnkSize 00000000.00000000 memSize 00000000.00000000 fb.f0. MCRstat 11411111.11151145 bnkSize 00000000.00000000 memSize 00000000.00000008 memTest (no-cache) LongWord Memory Test address:407fbff4 wrote:ffffffff read:00000000 address:407fbff0 wrote:ffffffff read:00000000 address:407fbfd4 wrote:ffffffff read:00000000 address:407fbfd0 wrote:ffffffff read:00000000 address:407fbfb4 wrote:ffffffff read:00000000 address:407fbfb0 wrote:ffffffff read:00000000 address:407fbf94 wrote:ffffffff read:00000000 address:407fbf90 wrote:ffffffff read:00000000 address:407fbf74 wrote:ffffffff read:00000000 address:407fbf70 wrote:ffffffff read:00000000 address:407fbf54 wrote:ffffffff read:00000000 address:407fbf50 wrote:ffffffff read:00000000 address:407fbf34 wrote:ffffffff read:00000000 address:407fbf30 wrote:ffffffff read:00000000
Aha! Firstly, the errors in the 407ff…–407fc… range appear to have gone away, but that’s not important right now. I’ve found what I was hoping for: there are no longer errors at the addresses ending in ec, e8, cc, c8, ac, a8, and so on. These followed the xxx01x00 pattern pointing at SIMM 2 and SIMM 3 on MMB1 on the right hand side, which I’ve just bathed in copious amounts of contact cleaner.
22-MAR-2024
Okay. Rinse and repeat, now with the MMB on the left hand side:DEC 3000 - M800 SROM 6.1 Mfg Test ff.fd.fb.f0. MCRstat 11111111.11801180 bnkSize 00000200.00000500 memSize 00000040.00000040 memTest (no-cache) LongWord Memory Test ....done. ....done. ....done. ....done. ....done.
Each dot takes a while before they resolve into a satisfying ‘done.’ Mesmerising. I could watch this all day.