What is MACfns?
MACfns is a large suite of APL functions (250) which uses
assembly language to achieve extraordinary speed and unmatched
precision. We will discuss some of the characteristics,
advantages, and complexities of using assembly language, with
code examples and illustrations of the development and usage of
MACfns.
What is Assembly Language?
Assembly language is a [truly] primitive computer language
which translates [barely] human-readable code directly into the
low-level CPU (central processing unit) instructions which the
computer executes, called machine code. Essentially, it is
one-to-one mapping of mnemonics into machine instructions, the
language of the hardware. Programming at any level beneath
assembly language requires a soldering iron (although a yet lower
level, called microcode, is accessible to the implementers of the
microprocessor chips themselves). Therefore, assembly language
is intimately connected with the machine upon which it executes.
The assembly language for an IBM(r) PowerPC(r) chip is completely
different from that for a z9(r) mainframe, an UltraSPARC(r) chip,
an Intel(r) Itanium(r) chip, or an Intel Pentium(r) processor.
Fortunately, Intel Corporation has diligently maintained a
strategy of backwards compatibility in what it calls Intel
Architecture (IA), which is the instruction set and underlying
logical structure for the x86 family of processors (which does
not include the Itanium). This family, started almost three
decades ago, is the broadest and the most commercially successful
microprocessor architecture ever invented. Starting with the
8086/8087/8088, it evolved through the 80286/80287, the
i386(r)/387, the i486(r), the Pentium and its several generations,
and continues today with the Core Duo(r) processors. It is an
amazing testament to Intel that most assembly language programs
written in 1978 for a 16-bit 8088 processor will execute
successfully in 2006 on a 64-bit Pentium Extreme Edition or
Core 2 Duo machine. (In fact, with minimal translation, 8-bit
assembly programs written for a 1973-release Intel 8080 will also
work.) Furthermore, other manufacturers, notably AMD(r), have
maintained close compatibility with IA, in some cases even
leading the way to new extensions.
For the balance of this paper, we will refer to "assembly
language" exclusively as that for the instruction set of IA, in
which MACfns is written. We will further base our discussion
only on the single-CPU chips on which all APL implementations
to date are based, and upon which we can comment knowledgeably.
A minor nomenclature issue: humans write "assembly code" in
"assembly language"; an "assembler" translates assembly code
into "machine code", which is what machines execute. Sometimes
assembly code is called "source code", "assembler code", or
confusingly, just "assembler" (meaning the human-authored code,
not the translator). The machine code is sometimes called
"object code", but even more confusingly, is also called
"assembler code"; we have been guilty of this imprecision in
MACfns documentation. However, in this paper we will use and
maintain the distinction between the terms assembly language,
assembly code, and machine code.
Assembly language has instructions for the basic arithmetic
and logical operations, data movement, comparisons, branching,
shifting, flag manipulation, low-level I/O, and miscellaneous
operations. A second set of assembly instructions manipulates
the floating point unit (FPU), which provides a richer set of
mathematical operations. And yet a third set, having several
subsets, provides access to various SIMD (single-instruction
multiple-data) capabilities of the latest generations of
processors. Intel has made tremendous strides in the underlying
implementation of these instructions over the years, with names
like pipelining, prefetching, superscalar execution, instruction
reordering, speculative execution, shadow registers, retirement
buffers, and branch prediction. While these optimizations
benefit the speed of programs, and often affect the optimal
selection and arrangement of assembly instructions in a program,
they do not affect the correctness of the program -- it executes
precisely the way it was programmed.
As in APL, instructions in assembly language are
interpretive. An instruction cannot safely execute without
having the results of prior instructions. Most instructions
execute in only one or two cycles, but some of the more
complicated can take 50 or more, and others can be executed in
multiples per cycle. (The cycle time of the machine is the
reciprocal of its frequency rating, measured in hertz; a 3 GHz
machine runs at three billion cycles per second.) While it was
once possible to predict the execution time of an assembly
language program from its emitted instructions and the speed of
the machine in which it executed, this is no longer true; one
must understand the optimizations underlying IA to be able to
even approximate the execution time. Furthermore, the
interactions of multiprocessing and multiple threads executing
concurrently introduces unpredictability, such as the untoward
flushing of memory caches.
The Difference Between Assembly Language and APL
For our purposes, however, we shall concentrate on what
differs between APL and assembly language, and why MACfns is
implemented in assembly language rather than in a higher-level
language such as C (in which the APL+Win(r) interpreter itself is
implemented). Consider the following, typed in desk calculator
mode:
⎕←A←121212+192947
314159
It seems trivial, but literally thousands of machine instructions
are executed for this calculation, taking a few microseconds.
The user is prompted and the machine goes into a wait state
awaiting input. Each character entered is translated and
recorded in a screen buffer; the text is rationalized (backspaces
and cursor movements resolved). Upon receipt of a carriage
return (Enter), storage is allocated for the statement; the line
is parsed and tokenized; the symbol table is updated; the digits
are converted into numbers via a series of multiplications and
additions, then classified as Boolean, integer, or floating point;
and (depending on the APL system) the line is evaluated
syntactically. Finally, APL is ready to execute the line.
Since the line has no leading ∇, ), or ], APL interprets it
as an APL statement. The constants are moved to an execution
buffer. The interpreter branches to an entry for the dyadic +
routine, which evaluates the rank, length, and type of its
arguments (checking for RANK, LENGTH, and DOMAIN errors),
allocates storage for the result (checking for WS FULL), and runs
a [one-iteration] loop which adds the two numbers (checking for
overflow and LIMIT error), storing the result in the temporary
memory entry. The A← assignment then, after assuring that A is
neither a function nor label, attaches the temporary memory entry
to A, freeing any storage previously associated with A. Finally,
the ⎕← assignment is evaluated by allocating storage for a
character vector, formatting the value via a series of divisions
and subtractions, outputting each digit to the screen or terminal,
and freeing the storage. Then APL outputs a new line and
six-space indent and returns to a wait state.
Whew! The exact sequence differs slightly in different APL
implementations, but you get the picture.
Now consider the following assembly code:
MOV EAX,121212
ADD EAX,192947
This moves the 32-bit constant 121212 into the register EAX (of
which there are only eight [16 in 64-bit processors]), then adds
another 32-bit constant to it. At the same time, it sets flags
indicating the sign of the result, the parity of its low 8 bits,
and whether it overflowed 4 bits (for BCD arithmetic), 31 bits
(signed doubleword arithmetic), or 32 bits (unsigned arithmetic).
That's all it does. There is no input or output; there are no
changes to memory. It also executes in less than a nanosecond.
Writing and Running Assembly Language in APL+Win
During Sykes Systems' development of MACfns, we have
developed tools to let us assemble and execute such instructions
on the fly (naturally, the diamond separates statements):
RUN 'MOV EAX,121212 ⋄ ADD EAX,192947'
10 bytes code:
EAX=314159 EBX=0 ECX=0 EDX=0
EBP=0 ESI=0 EDI=0 FLG=514
FLAGS: JA, JG [no flags]
This displays the seven general-purpose registers and the flags
register upon completion. These eight values are the explicit
result of ⎕CALL. The stack register ESP is not included, nor is
the instruction pointer EIP, segment registers, descriptor table
and status registers, nor other special-purpose registers.
Perhaps we want to examine the machine code itself, and then
run it separately:
⎕←MX←CODE 'MOV EAX,121212 ⋄ ADD EAX,192947'
+|Ù⍀¨∣⍙€
This gibberish can be clarified by using ∆AF from MACfns (atomic
function, here used as (⎕AV⍳MX)-⎕IO):
∆AF MX
184 124 217 1 0 5 179 241 2 0
256⊥0 1 217 124
121212
256⊥0 2 241 179
192947
2 HEX ∆AF MX ⍝ for the cognoscenti, or masochists
B8 7C D9 01 00 05 B3 F1 02 00
We see that the constants are indeed embedded in the code
(backwards, the much-debated "little endian" characteristic of
Intel Architecture). We can infer that 184 (B8) is the opcode
(operation or instruction code) for MOV EAX and 5 is the opcode
for ADD EAX (it's a little more complicated, however).
Now that we have assembled our teeny assembly code program
into machine code MX, let's try to run it using APL+Win's ⎕CALL,
which enables us to execute the machine code directly:
⎕CALL MX
DOMAIN ERROR
⎕CALL MX
^
We are missing two crucial elements. The first is that ⎕CALL
requires the first four bytes to be a system-dependent signature
value (it differs between APL+Win and APL+DOS) to help discourage
inappropriate use. Its absence is the cause of the DOMAIN ERROR.
The second is that after our snippet of code runs, it doesn't
know where to go; it fact it will scamper off into the workspace,
executing all manner of nonsense until, in a few nanoseconds, APL
tosses us out on the street (i.e., Windows(r)) for our
misbehavior.
The signature value is easily obtained from any MACfns
machine code (what we loosely refer to as assembler code in our
documentation):
↑⍙AF
2000042035
82 ⎕DR ↑⍙AF ⍝ show it as characters
386w
The final instruction executed must be a return-from-procedure
instruction (which pops the system stack and jumps to a memory
address which ⎕CALL initially placed there for a graceful exit).
Here is a proper assembly program for APL+Win (DD defines the
signature as a doubleword [four bytes], and RETN returns from a
near procedure, which characterizes all MACfns):
AC←'DD 2000042035 ⋄ MOV EAX,121212 ⋄ ADD EAX,192947 ⋄ RETN'
which we can assemble into machine code,
∆AF ⎕←MC←CODE AC
386w+|Ù⍀¨∣⍙€Ã
51 56 54 119 184 124 217 1 0 5 179 241 2 0 195
and execute,
⎕CALL MC ⍝ 121212+192947
314159 0 0 0 0 0 0 514
Voilà! Our first working assembly code program.
A key reason why MACfns uses Assembly Language
Herein lies a major advantage of assembly code and the
resultant machine code for MACfns: it is extremely lean. The
overhead of ⎕CALL MC is about the same as that for A+B when A and
B are integer scalars and the expression has already been parsed,
tokenized, and evaluated for syntax. There is no interpretive
overhead or analysis, storage allocation or movement of data
(except for the result of ⎕CALL itself), or interface to external
or asynchronous processes. Furthermore, for small programs
assembly code can be simpler and less verbose than C code.
If we are extremely careful, we can modify and run machine
code directly. (If we make a mistake, we could lose our APL
session when we execute it.) For example, before we noticed that
the embedded constant 121212 was encoded as ∆AF 124 217 1 0,
which is now the sixth-ninth bytes of MC. We can change the
constant in variable MC,
∆AF MC[6 7 8 9]
124 217 1 0
MC[6 7 8 9]←82 ⎕DR 7053 ⍝ or ∆AF⌽(4⍴256)⊤7053
and rerun the code,
⎕CALL MC ⍝ 7053+192947
200000 0 0 0 0 0 0 530
to see the effect. This is the basis of the mechanism MACfns
offers to customize the machine code for user-settable defaults
and other characteristics which extend its flexibility and
utility. As you might guess, the smallest machine code we can
run in APL+Win is
⎕CALL 2000042035 195 ⍝ or ⎕CALL '386wÃ'
0 0 0 0 0 0 0 530
which is solely the signature value and a RETN instruction, five
bytes in total.
Sample of MACfns Development
So far the code we've produced does nothing useful. Below
is a more substantial program with practical utility. While it
would never be released in MACfns (it's far too limited), it does
illustrate our style of writing assembly code. Observe that
assembly code uses the semicolon instead of a lamp for comments.
GET 'GCDI' ⍝ The code resides in an APL component file.
; ∇ Z←L ∆GCDI R
; [1] Z←↑(L,R)⎕CALL ⍙GCDI ∇
CPULEV EQU 0
; Greatest common divisor of integers in EAX and EBX saved to EAX.
; The result is positive unless both arguments are 0 or ¯2*31.
; For Boolean arguments, (L ∆GCDI¨R)≡L∨R
; Copyright Sykes Systems, Inc. 9Nov2006/Roy ⊂MACfns⊃
NEG EAX
JG SHORT POS
JZ SHORT ZER
NEG EAX
POS: NEG EBX ; 1≤EAX≤2147483647 or EAX=¯2147483648
JG SHORT LP
JZ SHORT XIT
NEG EBX
LP: XOR EDX,EDX ;⊤Euclidean algorithm calculates EAX=L GCD R
DIV EBX ;∣(EAX EDX)←0 EBX⊤EAX (ignore quotient EAX)
MOV EAX,EBX ;∣divisor is result or next dividend EAX
TEST EDX,EDX ;∣EDX is remainder ÷
MOV EBX,EDX ;∣remainder is next divisor if nonzero EBX
JNZ SHORT LP ;⊥12-byte/6-inst. loop
RETN ; exit
ZER: MOV EAX,EBX ; EAX was 0, so return ∣EBX
NEG EBX
JL SHORT XIT
XCHG EAX,EBX
XIT: RETN ; exit
Here is how we would assemble GCDI into code to be released
in MACfns. The machine code is stored in a global variable named
⍙GCDI, which is called by an APL function named ∆GDCI:
CPL 'GCDI' ⍝ Compile (assemble) code as global ⍙GCDI.
37 0 3 40
⍝ The result is some statistics about the size of the code.
⍙GCDI ⍝ the generated machine code
2000042035 75487479 ¯654895244 75488247 ¯604563852 ¯201862575
¯762985589 ¯193602933 ¯138179645 ¯1828619045 195 1229210439
538976288 2006110900
⎕←S←¯3↑⍙GCDI ⍝ its last three integers
1229210439 538976288 2006083100
(82 ⎕DR 2⍴S),(0,3⍴100)⊤↑⌽S ⍝ decipher the suffix
GCDI 2006 8 31 0
The assembly process automatically inserts the leading signature
value. It also appends the root name of the function, its
timestamp, and its required CPU level (here for any IA-32 chip,
such as an i386), all of which are documented standards of MACfns.
The trailing 195, the machine code for RETN, also happens to be
visible in ⍙GCDI; it usually is not.
⍙GDCI is an integer vector rather than a character vector.
As long as its right argument is simple and homogeneous (numeric
or character), ⎕CALL does not care about the rank, shape, or
datatype of what it executes; it simply points the machine (via
the instruction pointer) to the fifth byte of data (i.e., that
immediately following the signature value) and lets it run.
We chose the integer vector representation for MACfns because
Booleans are too sparse, characters display sloppily (as we saw
in MX and MC above) and are fragile (editing them can change them
unexpectedly), and floating point can contain values that are not
real numbers (such as infinities and NaN's, not-a-number) which
can freeze APL when displayed or be changed unexpectedly by APL
operations.
Below would be the APL cover function in MACfns. It has
standard naming and calling conventions, cryptic documentation
solely to assist memory (the full documentation is in a separate
file, and would be named dGCDI), and a distinctively-constructed
copyright notice on line [2] ('⍝∇' and 'Copyright' are separated
by ∆AF 8 255 (or (¯1↓⎕AV),⎕TCBS), invisible here but available to
code management systems).
∇ Z←L ∆GCDI R
[1] ⍝∇Greatest common divisor of two integer scalars.
[2] Z←↑(L,R)⎕CALL ⍙GCDI⍝∇Copyright 2006 Sykes Systems, Inc.
9Nov2006 ⊂MACfns⊃
∇
Atypically, the numeric arguments in ∆GCDI are presented directly
to ⎕CALL, and the result is [the first element of] that of ⎕CALL.
Since ⎕CALL only allows up to seven integers in its left argument,
and returns only eight, this method is extremely limiting.
Furthermore, there is no error checking beyond that which ⎕CALL
itself performs. Later we shall see how MACfns typically passes
and checks data.
All assembly-based functions in MACfns also have one or more
APL analog functions, which can be used for comparison, testing,
study, or in migrations to other APL systems. Here is what the
one for ∆GCDI might look like:
∇ Z←L aGCDI R
[1] ⍝∇Greatest common divisor of two integer scalars.
[2] ⍝∇Copyright 2006 Sykes Systems, Inc. 9Nov2006 ⊂MACfns⊃
[3] ⍝ The result is an integer scalar, positive
[4] ⍝ unless both arguments are 0 or ¯2147483648.
[5] ⍝ ⎕ERROR(2≠⎕NC'L')/'VALENCE ERROR'
[6] ⍝ ⎕ERROR(~1 1≡(⍴1/L),⍴1/R)/'RANK ERROR'
[7] ⍝ ⎕ERROR(323≠⎕DR 2,L,R)/'DOMAIN ERROR'
[8] L←↑L ⋄ Z←↑R
[9] :repeat
[10] R←Z
[11] :until 0=L←(Z←L)∣R
[12] :if Z≠¯2147483648 ⍝ ¯2*31
[13] Z←∣Z
[14] :endif
∇
We find they work identically, here used with each (¨):
L←149+⍳11
⊃L (L ∆GCDI¨180) (L aGCDI¨180)
150 151 152 153 154 155 156 157 158 159 160
30 1 4 9 2 5 12 1 2 3 20
30 1 4 9 2 5 12 1 2 3 20
but even in this small application ∆GCDI is about three times
faster than aGCDI. For less numerically tractable arguments,
such as
L←506750217 2140522190 2025047149 1220210790 1857457613
R←235578269 1766677554 1169568863 953647643 712205927
L ∆GCDI¨R
1 2 1 1 1
∆GCDI is over fifteen times faster. Actually, the machine code
is hundreds of times faster, but the ∆GCDI cover function and its
⎕CALL, catenate (L,R), and first (↑) take the vast majority of
the time here; the each (¨) masks the difference even more.
Other Reasons Why MACfns Uses Assembly Code
This brings us to a second compelling reason we use assembly
language: testing and iteration are hundreds to thousands of
times faster in machine code than in APL, and several times
faster than in compiled languages. Often no explicit tests at
all are needed, the flags set by preceding instructions providing
all the information needed. Special-case testing, for which the
cost in APL must be carefully weighed against the expected gains,
becomes far more feasible in assembly language. ∆UNTAB is
typical of a function with many useful special cases for which
the APL testing is far more expensive (to the point of not being
justified) than the testing in assembly language.
More individualized treatment of particular cases also
enables greater accuracy and speed. For example, our original
prototype assembly code for the hyperbolic sine (∆SINH or 5○R)
was only about twice as fast as APL and no more accurate. The
time was dominated by the mathematical computation ((*R)-*-R)÷2,
but careful numerical analysis disclosed a wealth of individual
optimizations for different ranges (almost two dozen in all).
Assembly language allows us to test scalar-by-scalar for all
these cases, and invoke much faster and more accurate code for
each. The result was that the minimum speedup increased to over
four times, with speedups of ten or more times not uncommon, and
rare cases exceeding 1,000 times.
Another powerful reason we use assembly language is that it
affords us direct access to the data structures of APL variables
and the internal representation of data. For example,
∆NELMe R computes the number of elements in each item of a nested
(↑¨⍴¨,¨R). It is up to 20-400 times faster than APL not only
because it can iterate vastly faster than each (¨), but also
because the information is already available internally in each
APL item. APL code must either ravel each item to obtain the
number of elements (thus potentially copying the entire array),
or perform expensive multiplications via ×/¨⍴¨R, which moreover
fails if R is empty. ∆NELMe also moves far less data (see below).
Similarly, ∆REVA (which reverses all axes of an array) can
be more than 100 times faster than APL on Boolean arrays because
it can simultaneously manipulate the data at the bit, byte (8
bits), word (16 bits), or doubleword (32 bits) level. Likewise,
∆F2LOG R (⌊2⍟∣R) does not actually compute any expensive
logarithms; instead, it directly analyzes the internal binary
exponent of floating point numbers to achieve its 25-150 times
speedup (you can do the same thing in APL, albeit more slowly,
by using ⎕DR).
A fourth reason to use assembly language is avoidance, or
anticipation, of data movement. Modern processors are
increasingly memory-bound, which means the CPU is idling awaiting
data to be fetched from main memory. The main architectural chip
optimizations are prefetching anticipated data and the
proliferation of caches -- primary, secondary, and even tertiary,
both on-chip and off-chip -- to stage that data closer to the CPU.
∆DIVID (divide increment by decrement, or (R+1)÷R-1) moves data
only once, whereas APL must move it thrice. The typical speed
ratio of 2.5-5 times essentially reflects this reduced memory
access more than any improvements we could make to division and
addition. A substantial part of the speedup to ∆NELMe is also
because large intermediate nested arrays (,¨ and ⍴¨) need not be
created and destroyed.
A final reason we use assembly language is that it enables
algorithms which would actually be more difficult to code, and
much more expensive to run, in other languages, including APL.
These tend to be high-iteration or bit-fiddling kinds of problems,
such as encoding and decoding data for transmission, data
encryption and compression, and one we have implemented in MACfns,
∆CRC (cyclic redundancy check). aCRC is not a trivial exercise
in APL, but in assembly language the fundamental operation uses
only four machine instructions per byte of data, and the
resulting machine code runs 1,000-4,000 times faster than APL.
There are other cases where assembly language has access to
instructions, facilities, and hardware which are unavailable in
other languages or APL.
Interpreter Support Services
∆GCDI uses the values of its arguments directly as the left
argument to ⎕CALL; they therefore must be integers. Only eight
functions in MACfns do so, including ∆CHKTS, which checks a
single ⎕TS-form date or timestamp T, conveniently a seven-element
integer vector. This method is extremely lean and fast; in fact
the overhead of calling the APL cover function alone is about
double the cost of invoking the ⎕CALL and executing the machine
code. Thus, while ∆CHKTS T is about 3-4 times faster than even
the most highly-optimized APL code (APL analog function aCHKTS1
[some functions in MACfns have multiple APL analog functions,
some coded for clarity or pedagogy and some for speed]),
expression ↑T ⎕CALL ⍙CHKTS is about 9-12 times faster, and most
of that time is due to the overhead of the ⎕CALL and first (↑).
However, this method of passing arguments and returning
results directly is generally too restrictive. The normal way
MACfns does so is indirectly, by referencing the ⎕STPTR of the
variable names:
⎕STPTR'R Z L' ⍝ differs between workspaces
12 4 9
These values are arbitrary but unchanging and unique symbolic
handles for each name in a workspace, and are independent of the
class, referent, or value of the name. (They used to be symbol
table pointers in APL+PC, hence the name ⎕STPTR.) APL2000(r)
supplies a set of internal subroutine calls with APL+Win, called
Interpreter Support Services, which enable assembly programs to
use these indirect pointers to inquire about, establish, modify,
and erase variables. Here is how MACfns uses them:
L←6 ⋄ R←⍳10 ⍝ the arguments for upper triangular matrix
(⎕STPTR'R Z L')⎕CALL ⍙UTRI ⍝ run the machine code
0 10 0 0 ¯4 37389060 37311392 583
Z ⍝ the result has been created:
1 2 3 4 5 6 7 8 9 10
0 2 3 4 5 6 7 8 9 10
0 0 3 4 5 6 7 8 9 10
0 0 0 4 5 6 7 8 9 10
0 0 0 0 5 6 7 8 9 10
0 0 0 0 0 6 7 8 9 10
The ⎕STPTR's of the names of the right argument, result, and left
argument are presented to ⎕CALL. ⎕CALL executes the machine code
in ⍙UTRI, passing the ⎕STPTR handles to it in registers EAX, EBX,
and ECX. The machine code calls APL2000's Interpreter Support
Services with these handles to reference the arguments L and R
and create the result Z, which it then fills in before returning
back to APL.
This is all rather clumsy, which is why MACfns has APL cover
functions to handle the interaction. Here is the one for ∆UTRI:
∇ Z←L ∆UTRI R
[1] →⍙rzl⎕CALL ⍙UTRI⍝∇L-row upper triangular mat from vec R:
R×[⎕IO+1](⍳L)∘.≤⍳⍴R
[2] ⎕ERROR'VALENCE ERROR'⍝∇Copyright 2006 Sykes Systems, Inc.
31Aug2006 ⊂MACfns⊃
[3] ⎕ERROR'DOMAIN ERROR'
[4] ⎕ERROR'RANK ERROR'
[5] ⎕ERROR'WS FULL'
[6] ⎕ERROR'NONCE ERROR'
[7] ⎕ERROR'LIMIT ERROR'
∇
As in ∆GCDI, we see standard naming and documentation conventions,
but here we encounter two new aspects. One is the use of ⍙rzl on
line [1] instead of ⎕STPTR'R Z L' as the left argument of ⎕CALL;
⍙rzl is a global variable which is preset (usually during
workspace initialization in ⎕LX) as
⍙rzl←⎕STPTR'R Z L
We use ⍙rzl because computing ⎕STPTR'R Z L' is fairly expensive,
since it must look up (and if necessary enter) the names in the
symbol table. In this case, executing ⎕STPTR'R Z L' takes about
as much time as executing ⍙rzl ⎕CALL ⍙UTRI, so using ⍙rzl instead
effectively increases the speed of ∆UTRI by about 50% in small
cases (the balance being the overhead of the ∆UTRI function call,
localizing its names, and the branch). To minimize the number of
⎕STPTR variables, the order of the ⎕STPTR's presented to MACfns
is always the same: right argument, result, {left argument,
{temporary local}}. Thus only a maximum of four are needed.
The other new aspect is the branch on line [1] and the error
signaling on lines [2] to [7]. Just before returning to APL,
⎕CALL loads a return code into the EAX register, which is the
first element of the result of ⎕CALL, to which APL branches,
normally to 0 (end the function). This lean mechanism is the
method MACfns uses to signal errors. MACfns also sometimes uses
the mechanism to branch to a line of APL code to complete a task,
or to deal with cases for which the machine code is not prepared.
A notable advantage of this method is that all possible errors
are listed visibly in the body of the APL cover function.
The Interpreter Support Services do have several weaknesses
which we try to circumvent as much as possible in MACfns. They
are relatively slow; they perform inadequate error checking; they
have undocumented behavior and return codes; they impose small
and arbitrary size limits; and they lack a number of useful
facilities. Nonetheless, on the whole they are adequate to our
needs for MACfns development.
The Disadvantages of Assembly Language
There are two main disadvantages to assembly language:
- It is cumbersome, hard to write, slow (to write), and fraught
with mystery.
APL programmers would be aghast at the primitiveness of
assembly language and its facilities. The following code
fragment is a loop [which is the only way] to add two arrays:
L22: MOV EAX,[ESI] ; get an item from L
ADD ESI,4
MOV EDX,[EBX] ; get an item from R
ADD EBX,4
ADD EAX,EDX ; add them
JO SHORT OVF ; jump if overflow
MOV [EDI],EAX ; store the result in Z
ADD EDI,4
DEC ECX
JNZ SHORT L22
We need ten instructions for each item in this 22-byte loop. Six
out of the seven general-purpose registers are used (only EBP is
available) -- heaven forbid we need do something more complicated
than addition. It only handles the case when both arrays and the
result are of integer datatype (⎕DR type 323), and of the same
size, which is but one of at least 14 simple cases which APL
gracefully accommodates (not to mention nested arrays). The code
does not show the machinations at label OVF needed to handle
integer overflow (blowup to floating point).
As does APL, assembly language offers many ways of doing
things. Below is another way of expressing the same loop:
L23: LODSD ; get an item from L
ADD EAX,[EBX] ; add it to R
JO SHORT OVF ; jump if overflow
ADD EBX,4
STOSD ; store the result in Z
LOOP L23
While this loop is only six instructions and 11 bytes, it still
uses five of seven registers (EDX is now also free). However, it
is also inexplicably (if one is not intimately familiar with the
specifications of modern processors) much slower. Both loops
would be prefaced in MACfns by about 200 handcrafted assembly
statements to initialize, check, and prepare everything, and the
Interpreter Support Services called doubtless have many more.
Finally, both loops shown above, while operational, have
deficiencies; the actual loop in MACfns would be more complex
(and faster).
Of course, it's not quite as bad as it first seems. We have
accumulated quite a set of tools and macros (bodies of drop-in
code) during the development of MACfns and our work for clients.
Nonetheless, new challenges arrive constantly. Furthermore, one
of the key advantage of MACfns is speed; simply-adequate working
code (like the loops above) is not sufficient.
A major mitigating factor in our development in assembly
language is that we use APL. We have many tools and utilities
written in APL (such as CODE, HEX, and RUN that we used above)
which assist in our development; MACfns itself is a vital
component. We use APL in algorithm design, modeling, timing,
testing, and verification. We use it to explore and verify
properties of the chips, workspace structure, and data
representations. We even use MACfns and APL to optimize our
assembly language, bootstrapping their capabilities to enhance
their capabilities.
- Assembly language is machine dependent.
Although MACfns is currently based on Intel Architecture,
its dominance lessens our concern about migration to other
platforms (no one has approached us yet). We provide APL analog
functions (often more than one) for every function in MACfns to
enable our clients to migrate to other APL systems and hardware
architectures.
By machine dependence we mean the evolution of the
microprocessors on which IA runs. Each new optimization Intel or
AMD introduces is one we must digest and consider incorporating
into MACfns; the timing differences in the loops above are an
illustration. Given limited resources, we can incorporate only
some. As we alluded to in the beginning of this paper, we have
not yet addressed hyperthreading or multi-core processors. The
APL+Win platform itself, stable for many years, may introduce
other changes with which we must cope; APLNext(r) is yet another
consideration.
Were MACfns written in a higher-level language, these
changes would be much simpler, some automatically performed by a
compiler. On the other hand, the utter reliability of MACfns (no
one has ever reported a bug), its high-precision accuracy, its
speed, and in some cases even its functionality, would be
compromised by depending on notoriously fickle compilers.
Limitations of MACfns
Because it is based in IA-32 assembly language, MACfns
runs only on 32-bit Intel Architecture machines, which includes
Intel processors such as the Pentium, Celeron(r), Xeon(r), and
Core Duo, and AMD processors such as the Athlon(r) and Opteron(r).
It also runs on 64-bit IA machines using the APL+Win 32-bit
interpreter. It does not run on RISC processors or mainframes.
Because MACfns uses APL+Win Interpreter Support Services,
and incorporates knowledge of the structure of variables in an
APL+Win workspace, it runs only under APL+Win (or APL+DOS). It
does not run under APLNext, APL+UNIX(r), APL2(r), APLX(r), or
Dyalog APL(r). However, MACfns has no direct dependencies on the
Windows operating system; thus (a slightly different, but fully
compatible) version of MACfns also runs in APL+DOS under the DOS
operating system.
Benefits of MACfns
MACfns has been developed over the last 19 years by Sykes
Systems, Inc., with its APL antecedents going back over 35 years.
It is written by and specifically for APL programmers, and is
thus easily understood by them. While the advantages of assembly
code are a major contributor to its extraordinary performance, it
is the thoughtful design, attention to detail, careful
programming, and literate documentation which truly distinguish
it. MACfns is in some ways a set of superior primitive functions
which improves both programmer and machine productivity. We can
summarize the benefits of MACfns as follows:
- Faster than APL+Win, typically from a few times to an order
of magnitude or more. Also typically 2-8 times faster, more
general, and handling larger arguments, than equivalent functions
in the 'ASMFNS' workspace distributed with APL+Win.
- Uses less storage by detecting identities, avoiding numeric
promotion, consolidating pointers, and using less (usually no)
intermediate storage. Avoids WS FULL and storage thrashing.
- Sometimes accepts larger arguments and produces larger array
results, and handles arrays of all rank and shape, avoiding
LIMIT ERROR's; all size limits are documented, including those
for corresponding APL expressions.
- Extended, and documented, domain and range for floating point
calculations, avoiding LIMIT ERROR's and improving accuracy.
- Greater accuracy via improved algorithms, use of extended
precision hardware, and careful attention to floating point
precision considerations, intermediate calculations, and rounding.
- Generality in the definition of the utility functions to
increase their usefulness in more situations.
- Extensive customization options, wherein documented locations
within the machine code may be changed for useful effect; the APL
cover functions may also be customized.
- Improved, consistent, and fully-documented error handling.
- Well-written APL analog functions to complement the APL cover
functions and machine code. These are useful in testing and
timing, during migration to other systems, and for studying APL
technique. Multiple techniques are often provided.
- Extensive and complete documentation, including identities,
limits, errors, related functions, and examples (430 pages in
all).
- Reliable; never unexpected performance, undocumented errors,
or catastrophic termination of the APL session. No bugs have
ever been reported, although we have uncovered some and have
reported them to our users.
Summary
We have described characteristics of assembly code, and how
and why we use it for MACfns. We encourage you to explore MACfns,
and to give us suggestions on future direction (the queue is
always growing, but the order changes). We thank those of you
who have purchased and are using MACfns. Your financial support
enables continuing improvement and new capabilities in the future.
Attached to this paper are several documents provided with
MACfns Release 3.0 (31Aug2006) which may be of general interest,
or which augment the information presented in this paper. The
papers in the APL2000 User Conference notebooks from 2002-2005
also provide particularly comprehensive overviews of MACfns.
-----------------------------------------------------------------
AMD, Athlon, and Opteron are trademarks of Advanced Micro Devices, Inc.
APLNext, APL2000, APL+UNIX, and APL+Win are trademarks of APLNow LLC.
Dyalog APL is a trademark of Dyalog Ltd.
Celeron, Core Duo, i386, i486, Intel, Itanium, Pentium, and Xeon are trademarks of Intel Corporation.
APL2, IBM, PowerPC, and z9 are trademarks of International Business Machines Corporation
APLX is a trademark of Micro APL Ltd.
Windows is a trademark of Microsoft Corporation.
SPARC is a trademark if Sparc International, Inc.
UNIX is a trademark of X/Open Company, Ltd.
-----------------------------------------------------------------
Sample APL Analog Function From MACfns
∇ Z←aCHKTS1 R;A;B
[1] ⍝∇Check ⎕TS-form timestamp.
[2] ⍝∇Copyright 2006 Sykes Systems, Inc. 31Aug2006 ⊂MACfns⊃
[3] ⍝ Limits are 1800 1 1 0 0 0 0 to 2200 12 31 23 59 59 999
[4] ⍝ (algorithm okay from 1600 to 3599).
[5] ⍝ aCHKTS1 implicitly checks and engenders errors (like ∆CHKTS).
[6] ⍝ aCHKTS explicitly checks and signals errors (unlike ∆CHKTS).
[7] ⍝ slowest→fastest: aCHKTS, aCHKTS1, ∆CHKTS, ↑R ⎕CALL ⍙CHKTS
[8] (R B A)←3⍴Z←7⍴R ⎕CALL'386wÃ' ⍝ check rank, length, domain; pad
[9] ⍝ absolute year limits 1600-3599:
[10] :if Z←Z≡1800 1 1 0 0 0 0⌈2200 12 31 23 59 59 999⌊Z ⍝ assure range
[11] :andif A>28 ⍝ of valid dates, done for 91.99%
[12] :andif A>0 31 28 31 30 31 30 31 31 30 31 30 31[⎕IO+B] ⍝ 7.94%
[13] Z←29 2 0≡A,B,=/×4 100 400∣R ⍝ .07%
[14] :endif
∇
-----------------------------------------------------------------
Examples of Documentation from MACfns
MACfns 31Aug2006
Z←∆AF R Atomic function
maps characters in ⎕AV to integers 0-255 or vice versa. The name
is derived from ⎕AF in IBM's APL2 product. ∆AF is useful in
avoiding clumsy constructions in documentation and code. For
example, instead of referring to the Euro character as ⎕AV[⎕IO+2]
(€), or setting ⎕IO and using ⎕AV[2] or ⎕AV[3], or explaining
elsewhere that ⎕IO has a particular value, ∆AF 2 always suffices.
If R is character, ∆AF is equivalent to (⎕AV⍳R)-⎕IO or
(⍴R)⍴↑82 323 ⎕DR R. If R is numeric, ∆AF is equivalent to
⎕AV[⎕IO+R] or (⍴R)⍴↑((⎕DR R),82)⎕DR R, so all items of R must
be integers 0-255 (else ∆AF signals an INDEX or DOMAIN ERROR).
The shape of the result is that of the argument, and its type
is integer or character. ∆AF is equivalent to [the APL coded,
not assembly coded] function AV from the 'ASMFNS' workspace
supplied by APL2000 with APL+Win version 3.6.02. ∆AF differs
from APL+Win version 3.6.02 in the following respects:
1. Integer tolerance for floating point is slightly different,
and more consistent, in ∆AF than in APL.
2. ∆AF needs space only for its result, but the equivalent APL
expressions need from 3% more to nine times the space.
3. ∆AF is up to 2.5-8 times faster than (⎕AV⍳R)-⎕IO for [almost]
any nonempty character array. ∆AF is up to 4-8 times faster
than ⎕AV[⎕IO+R] for 70 or more integers, 10-60 times for 50
or more Boolean items, and 10-25 times for three or more
floating point items. ∆AF is always faster than AV.
4. APL and AV can process up to 214,748,352 items. ∆AF R can
process more, depending on the datatype of R as shown in the
following table, which assumes R is a vector:
∆AF
Datatype of R Expression Alternative Maximum items
character (⎕AV⍳R)-⎕IO ↑82 323 ⎕DR R 268,000,126
Boolean ⎕AV[⎕IO+R] ↑ 11 82 ⎕DR R 1,072,000,504
integer ⎕AV[⎕IO+R] ↑323 82 ⎕DR R 429,496,719
floating point ⎕AV[⎕IO+R] ↑645 82 ⎕DR R 238,609,288
Errors:
DOMAIN ERROR The argument is nested or heterogeneous.
An item is from 0 to 255 but is not an integer.
INDEX ERROR An item is negative or exceeds 255.
LIMIT ERROR The result size would exceed 1,072,000,528 bytes
(see the table above).
Related functions: ∆CDR (change data representation),
∆XLATE (L[⎕AV⍳R]), ∆AVEPS (⎕AV∊R).
Examples:
∆AF 2 3⍴'MACfns'
77 65 67
102 110 115
∆AF 77 65 67 102 110 115 33
MACfns!
(Documentation for WS FULL, which is common to almost all MACfns
functions, is described elsewhere, and is not repeated in the
detailed documentation of each function. The documentation is
generally formatted for 60-line pages.)
-----------------------------------------------------------------
MACfns 31Aug2006
Z←∆SINH R Hyperbolic sine
is equivalent to 5○R, returning the hyperbolic sine of R.
The magnitude of R must not exceed 1025×⍟2 (about 710.47586).
The magnitude of the result is greater than or equal to that of
the argument, and the sign of the result is that of the argument.
The hyperbolic sine is the reciprocal of the hyperbolic cosecant
(∆RECIP ∆CSCH). ∆SINH is the odd component of the exponential
function. The even component is ∆COSH (6○R), so (for
(∣R)≤1024×⍟2, about 709.7827) (∆SINH R)+∆COSH R approximates
∆EPOW R (*R). ∆SINH is the inverse of ∆ASINH (¯5○R); that is,
R≡∆SINH ∆ASINH R and (for (∣R)≤1025×⍟2) R≡∆ASINH ∆SINH R. ∆SINH
differs from APL+Win version 3.6.02 in the following respects:
1. ∆SINH is slightly more accurate than 5○R, and always
preserves the identities (∆SINH R)=-∆SINH-R and (×∆SINH R)≡×R
and (∆SINH∣R)≥∣R (5○R sometimes does not). It also uses a
different algorithm for integers than for floating point;
the integer algorithm is faster and returns slightly more
accurate results.
2. 5○R returns 0 for values of magnitude less than 2*¯1022
(about 2.2251E¯308); ∆SINH returns these values.
3. 5○R signals a DOMAIN ERROR if the magnitude of any item
exceeds 1025×⍟2; ∆SINH signals a LIMIT ERROR.
4. If R is empty, 5○R returns an integer result; otherwise,
it returns a floating point result. If R is empty or Boolean
and all 0's, ∆SINH returns Boolean 0's; otherwise, ∆SINH also
returns a floating point result.
5. ∆SINH is faster than APL for 6-10 or more items, up to 4-12
times for floating point, and 10-16 times for integer or
Boolean. It makes no copy (and uses no space) if R is all
Boolean 0's, and so can exceed 1,000 times faster.
6. APL can process up to 134,217,723 floating point items,
178,956,964 integers, or 214,748,352 Boolean items. ∆SINH
can process up to 134,000,000 items (unlimited if R is all
Boolean 0's). APL can process nested arrays; ∆SINH cannot.
Errors:
DOMAIN ERROR The argument is character and not empty.
LIMIT ERROR The argument has more than 134,000,000 items and
is not all Boolean 0's.
An item exceeds magnitude 710.47586007394386.
NONCE ERROR The argument is nested or heterogeneous.
Related functions: ∆CSCH (÷5○R), ∆COSH (6○R), ∆TANH (7○R),
∆ASINH (¯5○R), ∆ACSCH (¯5○÷R), ∆EPOW (*R),
∆GD (¯3○5○R).
Examples:
A←¯1 0 .881373587 1 1.443635475 2
⊃A (∆SINH A)
¯1 0 0.881373587 1 1.443635475 2
¯1.175201194 0 1 1.175201194 2 3.626860408
R←3 3.951613336 3.989326806 4 4.025670416
⊃R (∆SINH R)
3 3.951613336 3.989326806 4 4.025670416
10.01787493 26 27 27.2899172 28
Z←(∆HALF-⌿∆EPOW 1 ¯1∘.×R) (∆HALF ∆SUBR ∆EPOW R) (-∆SINH-R)
D←¯1+2×⍳85 ⍝ odd coefficients 1 3 5 ... 169
Z←Z,((R∘.*D)+.÷!D) ((∆EPOW R)-∆COSH R) (∆RECIP ∆CSCH R)
Z∊⊂∆SINH R ⍝ identities
1 1 1 1 1 1
G←.5×1+5*.5 ⍝ Golden ratio
H←.2 .5,G,2 5 ⋄ ⊃H (⍟H) (∆SINH ⍟H)
0.2 0.5 1.618033989 2 5
¯1.609437912 ¯0.6931471806 0.4812118251 0.6931471806 1.609437912
¯2.4 ¯0.75 0.5 0.75 2.4
-----------------------------------------------------------------
MACfns 31Aug2006
Z←L ∆UTRI R Upper triangular matrix
creates an upper triangular matrix having L rows and ⍴R columns.
L must be a nonnegative integer scalar or one-item vector, and R
must be a simple numeric (or empty character) vector. When R is
Boolean and all 1's, then ∆UTRI is equivalent to the following
expression, which generates an upper triangular Boolean histogram:
(⍳L)∘.≤⍳⍴R
However, in general ∆UTRI multiplies each row of this histogram
by its vector right argument, and is equivalent to the following:
R×[⎕IO+1](⍳L)∘.≤⍳⍴R
or
((⍳L)∘.≤⍳⍴R)×(L,⍴R)⍴R
The shape of the matrix result is L,⍴R. The type of the result
is Boolean if empty, that of the right argument otherwise (APL
returns empty results as integer).
If the result is not empty, ∆UTRI is always faster than the
faster (which can differ) of the two expressions above in APL+Win
version 3.6.02. L ∆UTRI R⍴1 is faster than (⍳L)∘.≤⍳⍴R if the
result has 75 or more items (but usually less). Speedups vary
widely, ranging from 5-7 times faster for 500-item results up to
several hundred times for large Boolean results, and are greatest
for Boolean right arguments and smallest for floating point.
When L exceeds ⍴R, speedups grow as L increases relative to ⍴R.
∆UTRI needs space only for its result; APL needs 2-33 times
the space. APL can return up to 214,748,352 items; ∆UTRI can
return up to 134,000,000 items. APL can process nested arrays;
∆UTRI cannot.
Errors:
DOMAIN ERROR The left argument is not a nonnegative integer,
or exceeds 2,147,483,647.
The right argument is character and not empty.
LIMIT ERROR The result would contain more than 134,000,000
items (134E6<L×⍴R).
NONCE ERROR The right argument is nested or simple
heterogeneous.
RANK ERROR The left argument is not a scalar or one-item
vector.
The right argument is not a vector.
VALENCE ERROR No left argument is supplied.
Related functions: ∆opAND (L∘.^R).
Examples:
5 ∆UTRI¨(5⍴1) (9 .2 30),⍳¨5 10
1 1 1 1 1 9 0.2 30 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10
0 1 1 1 1 0 0.2 30 0 2 3 4 5 0 2 3 4 5 6 7 8 9 10
0 0 1 1 1 0 0 30 0 0 3 4 5 0 0 3 4 5 6 7 8 9 10
0 0 0 1 1 0 0 0 0 0 0 4 5 0 0 0 4 5 6 7 8 9 10
0 0 0 0 1 0 0 0 0 0 0 0 5 0 0 0 0 5 6 7 8 9 10
-----------------------------------------------------------------
MACfns 31Aug2006
ASMFNS
Several MACfns are replacements for functions contained in
the 'ASMFNS' workspace supplied by APL2000 with APL+Win version
3.6.02. All are faster and handle larger arguments and return
larger results; several offer other advantages such as greater
generality or customizability.
The following table lists the functions from the 'ASMFNS'
workspace which have direct MACfns replacements, and summarizes
the differences. See the documentation in the individual MACfns
detailed descriptions for more details. Those noted as "(APL)"
are APL coded, not assembly coded, functions; these were at one
time written in assembly language (either for APL+PC or APL+DOS),
but were never rewritten for APL+Win.
The numbers in the column entitled "Typical Speedup Ratio"
represent the cpu time used by the 'ASMFNS' function divided by
the cpu time used by the MACfns function in a variety of common
cases. These ratios tend to be conservative. The numbers in the
column entitled "Maximum Result Size Ratio" are the maximum ⎕SIZE
of the result of the MACfns function divided by the maximum ⎕SIZE
of the result of the 'ASMFNS' function. All ratios are rounded.
In both cases, higher is better (MACfns being more advantageous).
Maximum
Workspace Typical Result
'ASMFNS' MACfns Speedup Size
Name Name Ratio Ratio Other benefits/differences
--------- --------- ------- ------ --------------------------
AV (APL) ∆AF 2.5-8 1.1-5 uses less space
DTBR ∆DTBR 5-25 5 customizable fill,
(APL) identity detection
INDEX1 ∆INDEX_ 1.5-10 .5-10 uses less space
(APL)
LJUSTIFY ∆LJUST 10-30 5 customizable fill,
(APL) identity detection
LOWERCASE ∆LCASEDOS 2-3 2.5 (none)
MATtoSS ∆MATSS 2-3 8 specifiable (or no) fill,
customizable defaults
RJUSTIFY ∆RJUST 10-30 5 customizable fill,
(APL) identity detection
⍙RPL ∆TXTRPL 3-8 5 uses less space
(APL)
SSLEN ∆SSLENS 4-8 1.25 uses less space
(APL)
SSSHAPE ∆SSSHAPE 4-6 (same) uses less space
(APL)
SStoMAT ∆SSMAT 1.5-2.5 8 specifiable fill,
customizable options
TEXTREPL ∆TXTRPL .2-4 8 result sometimes differs
TRANSLATE ∆XLATE 1.5-3 2.5 (none, but ∆XLATE requires
L to have 256 elements)
:if L
⍙UCASEX ∆UCASEDOS 2-3 2.5 (none)
UPPERCASE ∆UCASEDOS
WHERE ∆INDS 1.3-6 1.25 handles all datatypes
WHERE R∊0 ∆ZNDS
WORDREPL ∆WRDRPL 5-20 5 uses less space, result
(APL) often differs
The following functions in 'ASMFNS' currently have no MACfns
equivalent:
APL coded: DLB DLTB DTB NBLENGTH SSASSIGN
SSCAT SSCOMPRESS SSDEB SSDLB SSDLTB
SSDROP SSDTB SSFIND SSINDEX SSTAKE
SSUNIQUE TELPRINT
assembler-based: DEB DIV OVER ROWFIND ∆∆VR
We welcome suggestions as to which of these (or others) would be
useful to you.