Comparing Fast Implementations of Bit Permutation Instructions

comparing.pdf. It's a snapshot of the page taken as our search engine crawled the Web.
The web site itself may have changed. You can check the current page or check for previous versions at the Internet Archive. Yahoo! is not affiliated with the authors of this page or responsible for its content.
Comparing Fast Implementations of Bit Permutation Instructions Abstract - Recently, a number of candidate instructions have been
proposed to efficiently compute arbitrary bit permutations.
Among these, GRP is the most attractive, having utility for other
applications in addition to permutation such as sorting and hav-
ing good inherent cryptographic properties. However, the cur-
rent implementation of GRP is the slowest of the candidates;
BFLY, on the other hand, is the fastest. In this paper, we exam-
ine the possibility of executing GRP on a butterfly or an inverse
butterfly network.

I. I
NTRODUCTION



Bit permutation operations are very useful in the design of
block ciphers. However, current microprocessors do not di-
rectly support arbitrary permutations and thus such operations
are slow when implemented with software instructions [1].
Recently, a number of candidate instructions have been pro-
posed in order to efficiently compute arbitrary permutations
on a general purpose microprocessor. These include BFLY,
IBFLY [2, 3], CROSS [1, 4], OMFLIP [1, 5], PPERM [1],
SWPERM with SIEVE [6] and GRP [1, 7].
The fastest proposed permutation instruction is the
BFLY/IBFLY pair. These instructions route the data bits
through complete butterfly and inverse butterfly networks,
respectively. BFLY and IBFLY are the fastest in two senses.
First, only a single BFLY instruction followed by an IBFLY
instruction is required to compute any arbitrary permutation.
Thus, any one of the n! possible permutations of n bits can be
performed in two instructions using BFLY and IBFLY; the
other permutation instructions require a sequence of lg(n) in-
structions to achieve any arbitrary permutation.

More importantly, the BFLY and IBFLY instructions
have simple circuit implementations that exhibit a lower la-
tency than CROSS, OMFLIP or GRP [2, 8, 9]. This latency is
less than that of an ALU of the same width. Since a proces-
sors cycle time is usually determined by the ALU latency,
each of the BFLY and IBFLY instructions can be accom-
plished in one cycle.
GRP, on the other hand, has the longest latency of these
proposed bit permutation instructions. GRP divides its data
bits into two subsets depending on the corresponding control
bits: if a control bit is 1, that data bit is grouped right (an R
bit); if a control bit is 0, that data bit is grouped left (an L bit)
(Fig. 1). The relative ordering within each subgroup is main-
tained. GRP is slower than BFLY or IBFLY. It requires up to
lg(n) instructions to compute an arbitrary permutation. Fur-
thermore, the current GRP implementation utilizes a series of
linear shift network that has a much greater latency than that
of BFLY or IBFLY, taking two to three cycles to compute
(see section V for a detailed discussion of original GRP cir-
cuit) [8, 9].
Comparing Fast Implementations of
Bit Permutation Instructions

Yedidya Hilewitz
1
, Zhijie Jerry Shi
2
and Ruby B. Lee
1

Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA, {hilewitz, rblee}@princeton.edu
However, there is an impediment to using BFLY and IB-
FLY instructions the need to supply n</i>lg(n)/2 control bits in
addition to the n bits to be permuted for each of these instruc-
tions. Thus, for n=64 in a 64-bit microprocessor, four 64-bit
source registers are required for BFLY or IBFLY, while the
typical processor architecture supports only two source oper-
ands per instruction. On the other hand, GRP has additional
desirable features aside from its use in performing arbitrary
permutations GRP can be used to perform hardware radix
sorting [10] and has strong inherent differential cryptographic
properties [11].
Consequently, we examine whether the GRP operation
can be implemented on the significantly faster butterfly or
inverse butterfly networks. While GRP would still require
lg(n) instructions to compute an arbitrary permutation, a faster
implementation would make GRP much more attractive.
We show that GRP cannot be performed on a butterfly or
inverse butterfly network but that two inverse butterfly net-
works may be used to group the R bits and L bits in parallel.
The outputs of the two networks are merged to complete the
GRP operation. We show the circuit that dynamically decodes
the n GRP control bits to the n</i>lg(n)/2 IBFLY control bits.
This circuit has significant latency and offsets the speed of the
faster inverse butterfly network. However, the new design is a
viable alternative to the original GRP circuit. In addition, a
fixed GRP operation can bypass the decoder and directly use
the fast IBFLY network; the decoding can be done statically
by the compiler.
The paper is organized as follows: Section II examines
the possibility of performing GRP on a butterfly or inverse
1
This work is supported in part by NSF ITR 0326372; Yedidya Hilewitz
holds a Hertz Foundation/Princeton Fellowship and an NSF Fellowship.
2
Z. J. Shi is now with the Computer Science and Engineering Department,
University of Connecticut, Storrs, CT 06269 USA, zshi@engr.uconn.edu
Fig. 1: GRP operation on 8 bits.
bfh
are L bits;
acdeg
are R bits. Instruc-
tion is
GRP Rs,Rc,Rd
, where Rs is the source register, Rc the control
register and Rd the destination register.
Yedidya Hilewitz, Zhijie Jerry Shi, and Ruby B. Lee, Comparing Fast Implementations of Bit Permutation Instructions, Proceedings of the 38
th
Annual Asilomar
Conference on Signals, Systems, and Computers, November 2004. butterfly network. Section III discusses the n:n</i>lg(n)/2 decoder
and specifies the decoder circuit. Section IV analyzes the
critical path of the circuit using the method of logical effort.
Section V compares the new GRP implementation to the
original one. Section VI concludes the paper.

II.

A
NALYSIS OF
GRP
ON
BFLY
AND
IBFLY

N
ETWORKS



We define two new operations: GRPR selects the R bits
moving them to the right of the result register and zeros the L
bits. GRPL selects the L bits moving them to the left of the
result register and zeros the R bits. Then, the GRP operation
can be considered as the combination of GRPR and GRPL.
We believe that GRPR may itself be a useful instruction that
functions as a generalized EXTRACT, suited to gathering and
right justifying an offset or table index that may have bits scat-
tered across a word.

GRPR and GRPL are similar to the packing operation
described in the context of packet routing networks [12]. The
packing operation can be performed on an inverse butterfly
network, but not on a butterfly network. Furthermore, the
GRP operation involving the simultaneous packing into two
groups is not possible with one pass through either an inverse
butterfly network or butterfly network. We propose replicat-
ing the inverse butterfly network, performing GRPR and
GRPL in parallel and combining the results (Fig. 2), similar to
the approach with linear shift networks taken for the original
GRP circuit [8, 9].
We can show that GRP cannot be performed on a butter-
fly or inverse butterfly circuit. Also, GRPR or GRPL cannot
be performed on butterfly due to path conflicts (contention for
a multiplexer node or wire in Fig. 3). Butterfly and inverse
butterfly networks are composed of a number of stages, where
at each stage two bits in a pair of bits are either swapped or
passed through to the next stage. In this paper, a control bit of
0 indicates swapping and 1 indicates passing through for
each pair of bits. Each successive stage is composed of two
disjoint subnetworks, each subnetwork a butterfly or inverse
butterfly network that is half the size. As these subnetworks
are disjoint, there exists a unique path from any input to any
output. If two inputs are to be simultaneously routable, the
unique paths to their respective outputs must be through dif-
ferent subnetworks; otherwise, a path conflict will exist.
To show GRP cannot be achieved on the butterfly net-
work, consider the case shown in Fig. 3. Bits d and h are the
only bits in the R group and are destined for positions 1 and 0,
respectively. These bits are paired in the first stage and the
paths to their outputs are both through the lower subnetwork.
Thus they are not routable conflict-free. For the inverse but-
terfly network, consider the case shown in Fig. 4. Bit c is des-
tined for the even subnetwork (bit position 6 in result). Bit d
is also destined for the even subnetwork (bit position 2 in re-
sult). As they must both be routed to the even subnetwork
after stage 1, there is a path conflict.
To show GRPR also cannot be achieved on a butterfly
circuit, we can just use the same counterexample as for GRP
(Fig. 3).



Fig. 2: Overview of GRP o