

## OpenVMS Across Four Architectures

### A Trip Down Memory Lane

Berlin, April 2024

Camiel Vanderhoeven | Chief Architect & Strategist



### VAX, Alpha, Itanium and x86-64 Comparing Architectures

# VmS Software

Camiel Vanderhoeven | SEP-2017





VAX, Alpha, Itanium and x86-64 **Comparing Architectures** 

Camiel Vanderhoeven | SEP-2017



### Introduction to the x86 Architecture

Camiel Vanderhoeven

September 15, 2015



### **Re-architecting SWIS for X86-64** Teaching a Not-So-Old Dog New Tricks

Camiel Vanderhoeven | SEP-2017



Oddities and surprises Interesting Features of the X86-64 Architecture

Camiel Vanderhoeven | SEP-2017



## X86 Heritage

### Where did this architecture come from?

### **X86 Development Timeline**

1978<br/>80861982<br/>2861985<br/>386







### New Design vs. Extensions

32-bit VAX

64-bit Alpha

### 64-bit Itanium



### 32-bit i386

16-bit 8086

286

PM

AMD64 64-bit



## And boy, did this present us with challenges...

### Booting

- Boot Processor put in 64-bit mode by Firmware
- Additional processors halted in 16-bit real mode

Legacy hardware

• Multiple ways to deal with timers, interrupts, etc.



## Processor Modes

### Today, I'd like to be...

### All modes are still there

present in modern processors



### For compatibility reasons, all processor modes going back to the 8086 are still























### 16 and 32 bit modes on X86





### These are not the modes you're looking for.











## **Architecture Documentation**

### What is this chip, and what does it do?

### We need documentation

- lot about the underlying platform
- So, we need documentation
- VAX: VAX-11 Architecture Reference Manual
- Alpha: Alpha AXP Architecture Reference Manual  $\bullet$
- Itanium: Intel Itanium Architecture Software Developer's Manual  $\bullet$
- X86-64: Intel 64 and IA-32 Architecture Software Developer's Manual lacksquare
- Here's what this looks like...



### When you're writing – or porting – an operating system, you need to know a





















### **ISA Extensions**

| Name   | First in            | F |
|--------|---------------------|---|
| x87    | 8086+8087 (1980)    | F |
| PM     | 80286 (1982)        | Ρ |
| IA-32  | 80386 (1985)        | 3 |
| PAE    | Pentium Pro (1995)  | Ρ |
| MMX    | Pentium MMX (1997)  | N |
| 3Dnow! | AMD K6-2 (1998)     | 3 |
| SSE(n) | Pentium III (1999)  | S |
| x86-64 | Opteron (2003)      | 6 |
| VT-x   | Pentium 4 (2005)    | V |
| AMD-V  | Athlon 64 (2006)    | V |
| AES-NI | Westmere (2010)     | A |
| AVX(n) | Sandy Bridge (2011) | A |
| TSX    | Haswell (2013)      | Т |
| MPX    | Skylake (2015)      | N |

### unction

- Floating Point Co-processor
- Protected Mode: Virtual Memory
- 32-bit
- Physical Address Extension
- MultiMedia Extension (Integer SIMD)
- 3D Graphics (Floating Point SIMD)
- Streaming SIMD Extensions (FP SIMD)
- 64-bit
- /irtualization support
- /irtualization support
- Advanced Encryption Standard
- Advanced Vector Extensions (FP SIMD)
- ransactional Synchronization Extension
- Memory Protection Extensions



## Some panned out

### VT-x and AMD-V

metal performance

PCIDs (not on list)

loss from mode changes



### •Help us run on hypervisors with near-bare-

## •Help us overcome part of the performance

## Intel TSX

### Transactional Synchronization Extensions

## The problem with locks

- One of the most difficult things to get right in a multi-threaded application (or OS) is synchronization between threads and processors Simultaneous memory access leads to conflicts (reading partially stale data,
- partial writes)
- Most common mechanism to overcome these issues is locking
- Locking is expensive •
- Even when the lock is free, it is still taken
- Trade-off: simplicity (coarse locking) vs performance (fine-grained locking)





## Intel TSX

- TSX aims to eliminate locking by exploiting the CPU's local data cache.
- During a TSX transaction, writes are **only visible** to the local CPU, they are not written to memory, and are **not seen** by the other CPU's.
- After a TSX transaction, the memory writes that occurred during the transaction become visible to the other CPU's atomically, and are then written to memory.
- During a TSX transaction, the CPU monitors memory writes by other CPU's. If a memory write by another CPU conflicts with a memory write or read from this CPU, the transaction is aborted (the memory writes are thrown away and never seen by anyone).



### Spinlock example

### No TSX:

xorl %ecx, %ecx incl %ecx spin\_lock\_retry: xorl %eax, %eax lock; cmpxchgl %ecx, (lock) jnz spin\_lock\_retry

incl (value)

movl \$0 (lock)

ret

### TSX:

## xbegin retry\_with\_spinlock incl (value) xend

ret

```
retry_with_spinlock:
    xorl %ecx, %ecx
    incl %ecx
spin_lock_retry:
    xorl %eax, %eax
    lock; cmpxchgl %ecx, (lock)
    jnz spin_lock_retry
    incl (value)
    movl $0 (lock)
```

25



## Some Observations (1/2)

- TSX debuted in Haswell, subsequently removed by a microcode update because of a bug, and re-introduced in late Broadwell steppings and Skylake.
- With TSX (either HLE or RTM), locking becomes very fine-grained when there are no conflicts. Fallback traditional locks can therefore be less fine-grained without too much of an impact on performance.
- Transactions can be nested (but abortions are passed up).
- Granularity of conflict detection is **implementation specific**, and probably equal to the size of a cache line (64 bytes in Skylake).
- There is an implementation specific limit to the number of cache lines that can be involved in a transaction (512 cache lines in Skylake). If this limit is exceeded, the transaction will abort. Limited associativity of cache lines may cause an abort even before this limit is reached.



## Some Observations (2/2)

- A context switch will also lead to aborted transactions.
- The risk of an abortion grows with the transaction size, even without any conflicts
- Therefore, fallback code must always be provided. This code needs to use a traditional lock, in case more than one thread needs to fall back.
- When an abortion happens in RTM, the fallback code is informed of the reason for the abortion, this can be a conflict, an explicit XABORT instruction, a lack of resources, the execution of an incompatible be).



instruction, or the occurrence of an uncommon event (whatever that may





### **Possible Uses in OpenVMS**

- I/O locks  $\bullet$
- Scheduling lock
- Interlocked queue operations •







### Prototyped queue instructions

 Better performance •Ran in callers' mode

And then TSX was cancelled and removed from chips (again!)



# Register Set





## VAX Register Set

| R0     |
|--------|
| R1     |
| R2     |
| R3     |
| R4     |
| R5     |
| R6     |
| R7     |
| R8     |
| R9     |
| R10    |
| R11    |
| AP/R12 |
| FP/R13 |
| SP/R14 |
| PC/R15 |
|        |

| PSL   |
|-------|
| IPR's |

## Alpha Register Set

| R0  | R16    |
|-----|--------|
| R1  | R17    |
| R2  | R18    |
| R3  | R19    |
| R4  | R20    |
| R5  | R21    |
| R6  | R22    |
| R7  | R23    |
| R8  | R24    |
| R9  | AI/R25 |
| R10 | RA/R26 |
| R11 | PV/R27 |
| R12 | R28    |
| R13 | FP/R29 |
| R14 | SP/R30 |
| R15 | RZ/R31 |
|     |        |

PC PS IPR's

| F0  | F16 |
|-----|-----|
| F1  | F17 |
| F2  | F18 |
| F3  | F19 |
| F4  | F20 |
| F5  | F21 |
| F6  | F22 |
| F7  | F23 |
| F8  | F24 |
| F9  | F25 |
| F10 | F26 |
| F11 | F27 |
| F12 | F28 |
| F13 | F29 |
| F14 | F30 |
| F15 | F31 |
|     |     |

## Itanium Register Set

| RZ/GR0 | GR16 |            |
|--------|------|------------|
| GR1    | GR17 | GR32-      |
| GR2    | GR18 | GR127      |
| GR3    | GR19 | Reg. Stack |
| GR4    | GR20 |            |
| GR5    | GR21 |            |
| GR6    | GR22 |            |
| GR7    | GR23 |            |
| GR8    | GR24 |            |
| GR9    | GR25 |            |
| GR10   | GR26 |            |
| GR11   | GR27 |            |
| GR12   | GR28 |            |
| GR13   | GR29 |            |
| GR14   | GR30 |            |
| GR15   | GR31 |            |
|        |      |            |

| FR0   | Pr0   |
|-------|-------|
| FR1   | Pr1   |
| FR2   | Pr2   |
| FR3   | Pr3   |
| FR4   | Pr4   |
| FR5   | Pr5   |
| FR6   | Pr6   |
| FR7   | Pr7   |
| FR8   | Pr8   |
| FR9   | Pr9   |
| FR10  | Pr10  |
| FR11  | Pr11  |
| FR12  | Pr12  |
| FR13  | Pr13  |
| FR14  | Pr14  |
| FR15  | Pr15  |
| FR127 | Pr127 |

IP

UM

IPR's

## x86 Register Se

| RAX | MMX0/FPR0 | XMMO  |
|-----|-----------|-------|
| RCX | MMX1/FPR1 | XMM1  |
| RDX | MMX2/FPR2 | XMM2  |
| RBX | MMX3/FPR3 | XMM3  |
| RSP | MMX4/FPR4 | XMM4  |
| RBP | MMX5/FPR5 | XMM5  |
| RSI | MMX6/FPR6 | XMM6  |
| RDI | MMX7/FPR7 | XMM7  |
| R8  |           | XMM8  |
| R9  |           | XMM9  |
| R10 |           | XMM10 |
| R11 |           | XMM11 |
| R12 |           | XMM12 |
| R13 |           | XMM13 |
| R14 |           | XMM15 |
| R15 |           | XMM16 |
|     |           |       |

|  | Т |
|--|---|
|  | L |
|  |   |

RIP RFLAGS

IPR's

## Dealing with fewer registers

C/Bliss

•No issues, compiler can take care of it

Assembly

Some juggling required

VAX Macro (w/Alpha extensions)

• Alpha Pseudoregisters (per mode)



# Instruction Encoding



# I or 2 bytes, 1 Opcode per Operation

☐ 1 byte Containing addressing mode and register Onumber, up to 4 bytes of displacement, immediate data, or address

**C** same N [Operand





### Alpha

## 

## 26 bits, encoding up to 3 registers, up to 21-bit displacement, 8bit literal value, up to 16-bit function specifier





### Itanium

## Syllable 41 bits

Opcode 4 bits

Operands 31 bits, typically 10-bit function and 3 registers

> Predicate 6 bits

















### Phew...

Compilers • LLVM

Debugger etc.

Open-source instruction decoders

Exception handling

Manual labor



## Memory Layout



### Memory Layout

|                     | VAX             | Alpha                     | Itanium               | <b>x86</b>     |
|---------------------|-----------------|---------------------------|-----------------------|----------------|
| Address size        | 32              | 64                        | 64                    | 64             |
| H/W page size       | 512             | 8K/64K/512K/4M            | 4K-4G                 | 4K/2M/1G       |
| Split VA Space      | -               | yes                       | yes*                  | yes            |
| Regions             | _               | -                         | 8                     | -              |
| PT Levels           | 2               | 3                         | 3*                    | 4              |
| PTE Cache           | -               | -                         | VHPT                  | PDE cache      |
| Virt. Addr. Size    | 32              | 48                        | 48*                   | 48             |
| Phys. Addr. Size    | 32              | 44                        | 50                    | 52 (48)        |
| Physical addressing | yes             | yes                       | yes                   | _              |
| Segmentation        | _               | -                         | _                     | yes (kind of)  |
| Prot. Bits in TLB   | 4 enc[KESU][RW] | 11 [KESU][RW],<br>FO[RWE] | 7 enc[KESU], enc[RWX] | 3 R/W, U/S, XD |

\* OS Implementation Dependent, figures given are for OpenVMS





## The Case for SWIS

### What is SWIS, and why is it needed?

### **OpenVMS Assumes Things...**

- VAX/VMS was designed in tandem with the VAX hardware architecture. Where desirable, hardware features were added to satisfy the OS' needs.  $\bullet$ A lot of OS code was written to make use of these hardware features.





### What are these Assumptions?

- 4 hardware privilege modes  $\bullet$
- Each with different page protections
- And with their own stack
- 32 Interrupt Priority Levels
- 16 for Hardware Interrupts
- 16 for Software Interrupts
- Software Interrupts are triggered immediately when IPL falls below the associated IPL
- The hardware provides atomic instructions for queue operations
- The hardware provides a set of architecturally defined Internal Processor Registers (IPRs)

• Asynchronous Software Trap (AST) associated with each mode, triggered immediately when IPL falls below ASTDEL (equally or less privileged mode)



### How does Alpha meet these Assumptions?

- Alpha is a very clean RISC Architecture
- But OpenVMS was definitely in the Alpha Architecture designers' minds The 4 modes OpenVMS needs are part of the basic Alpha architecture
- $\bullet$ lacksquare
- PALcode, code supplied by firmware that has more privileges than even kernel mode, and which is uninterruptible, provides the flexibility to implement OS specific features
- IPLs, Software Interrupts and ASTs are implemented through a combination of hardware support and PALcode
- Atomic queue instructions are provided by PALcode
- PALcode also provides the mapping from IPRs as expected by OpenVMS to the hardware implementation's IPRs



### So how about Itanium Hardware?

- was considered
- Offers the 4 modes OpenVMS needs
- lacksquareinterrupts only
- No compatible software interrupt mechanism or ASTs •
- No atomic queue instructions
- No OpenVMS-compatible IPRs



### Very different story, Itanium's design was finished before OpenVMS as an OS

The TPR (Task Priority Register) provides an IPL-like mechanism for hardware





### Hence, SWIS

- involved in mode changes.
- SWIS implements the software interrupt and AST support required by OpenVMS, using hardware support as available.
- Other code in the OS (with some special support from the SWIS code to ensure atomicity) provides atomic queue instructions
- A combination of code in SWIS and other code in the OS provides OpenVMScompatible IPRs
- SWIS makes the Itanium CPU look more like a VAX to the rest of the OS

#### SWIS (Software Interrupt Services) is a piece of low-level OS code that is



### **Bridge Function**

the features supported by the hardware



## SWIS bridges the gap between the assumptions made by the rest of the OS to



### SWIS on X86-64

- Because a similar mismatch exists between OpenVMS' assumptions and the hardware-provided features, SWIS will be ported to X86-64.
- Ported means mostly re-written here, as the provided features are very different between Itanium and X86-64.
- On X86-64, SWIS will have to do more, as the X86-64 architecture does not provide the 4 mode support OpenVMS needs.
- Because of this, SWIS on X86 will not only be active when transitioning from an inner mode to an outer mode, but also when transitioning from an outer mode to an inner mode.
- Also because of this, SWIS now needs to become involved in memory management (in a supporting role).
- There's good news too: the Itanium architecture has some features that are very complex to manage (think RSE), that are absent in X86-64.



### **Swis on X86-64**

### **OpenVMS Expects:**

- 4 Modes, different page protections, separate stacks
- 32 IPLs (16 h/w, 16 s/w)
- Software interrupts tied to IPLs
- Per-process, per-mode ASTs, delivered when below ASTDEL
- Atomic queue instructions
- VAX-like IPRs

#### X86-64 Offers:

- 2 rings, different page protections, separate stacks
- 14 hardware TPR's, mask off hardware interrupts in groups of 16
- Software interrupts unaffected by TPR's. No IPL's
- No AST-like concept at all
- No atomic queue instructions
- X86-64 IPRs





## Itanium vs X86-64

### Differences relevant to SWIS between Itanium and X86-64

### **System Services**

#### **Itanium**:

- EPC Instruction
- Can only be called within a protected promote page
- First instruction executed in kernel mode is the one following the EPC instruction
- Does not disable interrupts
- Need to consider register stacks

- **SYSCALL** Instruction
- Can be called anywhere
- First instruction executed in kernel mode is at a fixed, OS-determined (MSR) address
- Can be set up to atomically disable interrupts
- No register stack







### Interrupts and Exceptions

#### Itanium:

- IVT is code
- Does not switch stacks
- Interrupted state is stored in registers
- Need to consider register stack

- IDT contains descriptors that points at code
- Switches to kernel-mode stack if  $\bullet$ needed
- Interrupted state is stored on • kernel-mode stack
- No register stack





### Memory Management

#### **Itanium:**

- 4 modes
- OS needs to handle an exception on TLB miss
- Memory translations are at the OS' discretion (Translation Registers)
- Support for URKW and similar protections
- 8 2<sup>61</sup> byte regions in VA

- 2 useable rings (0 and 3)
- Processor walks page tables on TLB miss
- All memory translations are pagetable based
- No support for ring 3 read, ring 0 write
- No regions



## Finding Per-CPU SWIS Data Structure

#### **Itanium**:

- Mapped through a dedicated TR •
- At fixed VA (0xE...0)
- Mapping of VA to PA is different on • each processor, and TR is not touched on a context switch

- Mapped through the normal page tables
- At a different VA for each CPU
- Mapping of VA to PA is identical for each process, so it doesn't change on a context switch
- Can be found at address 0 in the %GS segment on each CPU
- %GS segment is loaded from segment descriptor 0x28 on each CPU
- Segment 0x28 base VA is different on each CPU





### Finding the Per-CPU SWIS Data Structure (2)

#### Itanium:

- Accessing a field at offset 0x80: movl r28 = 0xe0000...00801d8 r28 = [r28]
- Getting the address of the structure: movl r28 = 0xe0000...0000

- Accessing a field at offset 0x80: movq %gs:0x80, %rax
- Getting the address of the structure: rdgsbase %rax - or
  - movq %gs:0, %rax





## Porting SWIS



### The porting process, a birds-eye overview

### **Design Phase**

years full-time), in several phases:

- Basic design (not detailed enough to base implementation on) Detailed design for System Service dispatching Detailed design for Hardware Interrupt and Exception handling Detailed design for Software Interrupts and ASTs

- Detailed design for Processes and Kernel Threads

SWIS for X86-64 was designed over a period of 1.5 years (1 year part-time, 0.5)



### **Design Review Phase**

- Partial reviews as the design progressed
- In-depth 3-day review between myself and Burns Fisher
- This one turned up a design flaw that could have enabled unprivileged code to bring down the system
- Complete walk-through and review in one of our weekly X86-64 engineering meetings
- A lot of the content in this presentation is based on the slides I prepared for that walk-through





### **Implementation Phase**

Implementation started in May 2017, broken down into different parts:

- $\bullet$
- **Data Structure Definitions**  $\bullet$
- VAX/Alpha IPRs  $\bullet$
- Hardware Interrupts and Exceptions
- System Services
- Software Interrupts  $\bullet$
- ASIS
- Initialization <
- Processes and Scheduling lacksquare



## Quick and Dirty Exception Handling for early code that needs something





### **2 SYSTEM PRIMITIVES execlet builds**

- Compatibility build, works on any x86-64 CPU we support
- the following:
- Address Space Numbers (PCIDs) in TLB 1.
- **RDGSBASE** instruction 2.
- 3. (MMX, SSE, AVX)
- Highest Performance build targets Intel processors made after 2013 (Ivy) Bridge and beyond).

## Performance builds, optimized for CPUs that have support for one or more of

XSAVES/XRSTORS instructions for saving/restoring extended ("floating point") registers



## The Guts of SWIS

### A technical overview, with lots of details



Kernel Mode







### Mode "Components"

- Processor ring (0 for K, 3 for ESU)
- Stack pointer
- Address Space Number
- Page Table Base
- Current mode as recorded in the SWIS data structure
- A mode is "canonical" when all the above are in agreement
- SWIS should be the only code that ever sees non-canonical modes •
- We prototyped this on Itanium lacksquare



## **Basics of Mode Switching**

- Interrupt or SYSCALL instruction
- Switches CS and SS to ring 0
- Switches to the kernel-mode stack (interrupt only, not SYSCALL) 2.
- **Disables interrupts** 3.
- Get fully into kernel mode (ASN, PTBR, stack, DS, ES)  $\bullet$
- Going in? -> Build return frame on stack  $\bullet$
- Going out? -> Deliver SwInts and ASTs as needed  $\bullet$
- Get into destination mode (ASN, PTBR, stack, DS, ES)
- IRET or SYSRET instruction
- Switches CS and SS to ring 3
- Switches to the outer-mode stack (IRET only, not SYSRET) 2.
- Enables interrupts 3.



### Context Switching

- Need to save/restore 4 PTBRs instead of 1
- MMX/SSE/AVX state will get saved using the XSAVE(S)/XRSTOR(S) • instructions
- Need to deal with performance monitoring and debugging registers  $\bullet$
- the process/thread.

Will have a "never scheduled" flag for initial process/thread creation (Alpha uses canonical stack, Itanium uses the PFS register)  $\rightarrow$  On first scheduling, the stacks get zapped, and we go to the appropriate initialization routine for



### So how close were we?

Pretty close

### We missed a few things though

- Impact of not having a PROBE instruction
- Need to deal with emulated Alpha registers
- •Use of interrupt stacks
- Need to track where the kernel stack is to detect a KSNV

### And we made some improvements

- Lightweight system services
- •All SWIS variants in a single image







### To learn more please contact us: vmssoftware.com info@vmssoftware.com +1.978.451.0110

