How an Accident of Hardware Design Encouraged Open Source (O'ReillyNet)
[Posted February 23, 2007 by ris]
O'ReillyNet delves
into computing history. "Back in the early 1970s, the hardware
engineers at Digital Equipment Corporation made a decision about how their
new computer, the PDP-11, would address memory. I believe their decision
had the unintended, butterfly-effect consequence of helping to bring the
open source software movement into existence."
(Log in to post comments)
How Open Formats Encouraged Open Source (O'ReillyNet)
Posted Feb 23, 2007 18:56 UTC (Fri) by khim (subscriber, #9252)
[Link]
Interesting article. But while it's true that "Storing Data in Binary Fosters Closed Proprietary Systems; ASCII Fosters Openness" you should never forget that it's just a rule. It all boils down to the people in the end: people who prefer openness usually tend to use ASCII (nowadays UTF-8), people who like to invent lock-in schemes prefer to use Binary, yet usage of ASCII (nowadays UTF-8) or Binary does not guarantee anything. You can have perfectly open and portable binary format (for example TeX's DeVice Independent format) and you can have unreadable text-based format (Open XML is the famous example).
Oh, and if you think cpio is dead - try to ask more knowledgeable about RPM format some day :)
Exaggerated.
Posted Feb 23, 2007 18:57 UTC (Fri) by AJWM (guest, #15888)
[Link]
The article makes a big deal out of the byte ordering differences between IBM's 360/370 series and DEC's PDP-11 series (both of which used an 8-bit byte).
It had nothing to do with byte order. It had everything to do with the fact that the IBM used EBCDIC as its native character set, which among its other faults, is non-contiguous in its alphabetic sequences. That's what makes text manipulation such a pain in the butt on those systems. The article doesn't even mention EBCDIC.
(Mind, that's more an OS limitation than hardware -- Amdahls Unix V7 for the 370 architecture, UTS, used ASCII, and porting apps from a PDP-11 to a 370 running UTS was trivial.)
A contemporary major line of systems, Burroughs', used a 48-bit word and strings of either 6-bit characters (common on many architectures, hence the lack of lower case letters) or 8-bit, the latter either EBCDIC or ASCII. The systems had dedicated string processing hardware, the segment descriptor that pointed to the string specified the character set.
Exaggerated.
Posted Feb 25, 2007 12:09 UTC (Sun) by eru (subscriber, #2753)
[Link]
It had everything to do with the fact that the IBM used EBCDIC as its native character set, which among its other faults, is non-contiguous in its alphabetic sequences. That's what makes text manipulation such a pain in the butt on those systems.
An anglo-centric view. For most other languages, alphabetic sequences
are non-contiguous in all widely used character sets.
Exaggerated.
Posted Feb 26, 2007 23:17 UTC (Mon) by jzbiciak (✭ supporter ✭, #5246)
[Link]
Well, perhaps today, but put it in context. In the relevant era (1970s in this case), when you're comparing EBCDIC vs. ASCII, character classification is quite a bit easier in ASCII than in EBCDIC. EBCDIC might not have been so bad to work with on BCD-centric hardware, but on machines that only focused on 2s complement, I can see it causing some heartburn in some cases.
In the modern day, obviously the debate's moot in the presence of Unicode, UTF-8, etc., and plenty of MIPS, RAM and disk to go around (as compared to the 1970s)....
Exaggerated.
Posted Feb 26, 2007 18:19 UTC (Mon) by MBR (guest, #43632)
[Link]
I think you're misunderstanding the relevance of the IBM 360/370 to my point.
In the early 1980s, a number of startups wanting to build the next generation of desktop computers looked to Unix as an alternative to writing their own operating system from scratch because it was written in C rather than assembler. Many of them (e.g. Sun Microsystems) were using CPUs like the Motorola 68000 whose byte numbering scheme was exactly the same as that used by the IBM 360. Whether the Motorola engineers who designed the 68000 had copied this from the 360 or copied it from someone who copied it from someone who copied it from the 360 is impossible to say. This generation of computers all used ASCII. No-one in their right minds would have used EBCDIC, except for IBM who was stuck trying to maintain backward compatibility with their earlier offerings.
The PDP-11 is relevant because Unix had been rewritten from assembler into C on a PDP-11 in 1973. It had ported easily from DEC's PDP-11 to DEC's VAX because DEC designed had the VAX to be a grown-up PDP-11. The VAX followed the PDP-11's byte numbering scheme. By the early 1980s Unix was mature enough that a startup could consider using it for their OS.
The result was that programmers at countless Silicon Valley startups in the early 1980s found themselves constantly struggling to eliminate byte-order dependencies in C code as they ported Unix from one of the two standard VAX implementations, the Berkeley distribution or the Bell Labs distribution, to the new microcomputers. In comparison to the hordes of programmers in Silicon Valley, around Boston's Route 128, and elsewhere, who were porting code from ASCII-based little-endian machines to ASCII-based big-endian machines, the number of programmers porting code between EBCDIC-based IBM machines and these new machines was small. And so the former group had a much greater influence on common programming practice in the Unix world than the latter group.
Source available.
Posted Feb 23, 2007 19:11 UTC (Fri) by AJWM (guest, #15888)
[Link]
Another point, going back to the mainframe and PDP-11 era, is that most software was distributed as source or at least "source available". Mainframes especially might feature 3rd party add-ons and configuration differences that would require apps to be compiled on the hardware they'd run on.
The first "open source" program I ever encountered was a "star trek" game (great grandaddy of nettrek, perhaps), written in Algol for the Burroughs B6700 and similar. It had changed hands (university computer centers, mostly) several times before I saw it, this was circa 1975.
Even into the early '80s I was supporting commercial mainframe packages that were distributed with source to allow local customization. Not truly open source perhaps, but (IMHO) it was a reaction to the _closing_ of that by Microsoft and others that helped stimulate the Open Source movemnt.
Source available.
Posted Feb 26, 2007 18:43 UTC (Mon) by MBR (guest, #43632)
[Link]
Very specifically, what helped stimulate the Free Software movement (which was rebranded the "Open Source" movement in the late 1990s) was Richard Stallman's frustration with the way corporations were trying to lock up everything in sight. In the late 1970s, he used to rant to me and anyone else who'd listen, about how "they're taking away our freedom to program." According to "Free as in Freedom," Sam Williams' biography of Stallman, one of the major culprits was Xerox. Unlike most of us programmers who were frustrated with how corporations were making it impossible for us to share our code, Stallman had the realization that he could stop this by distributing useful code under a license that required recipients of the code to behave the way programmers had been used to behaving throughout the 1970s.
How Open Formats Encouraged Open Source (O'ReillyNet)
Posted Feb 23, 2007 21:52 UTC (Fri) by accensi (guest, #11754)
[Link]
The author make a mess of different concepts. If I remember well, memory storage in IBM 3x0 and PDP/Vax for strings were done in the same order, from left to right. Binary data was different. the famous and old distinction between "little endian" and "big endian" way of ordering the bytes from left or right when storing a word in memory, from a register. See http://en.wikipedia.org/wiki/Endian
Other important differences were the basic unit of I/O in records, of fixed size in IBM (normally of multiple of 80 bytes, the size of an IBM/Hollerith punched card), and variable size in DEC world, perhaps from its use of paper tape.
How an Accident of Hardware Design Encouraged Open Source (O'ReillyNet)
Posted Feb 26, 2007 18:52 UTC (Mon) by MBR (guest, #43632)
[Link]
I'm afraid you're misremembering how strings were stored. When I was in DEC's Small Systems Group, I wrote large parts of RT-11 MU-BASIC for the PDP-11, and before that in college, I'd spent years writing assembly code for the IBM 360. So I was pretty sure I was remembering correctly when I wrote the article. But just to be sure, I dug out my old PDP 11/45 Processor Handbook and my old BAL-360 manual and double-checked before I submitted the article.
How an Accident of Hardware Design Encouraged Open Source (O'ReillyNet)
Posted Feb 26, 2007 23:24 UTC (Mon) by jzbiciak (✭ supporter ✭, #5246)
[Link]
BTW, someone had commented on the article that big vs. little endian only makes a difference when the machine's word-size is larger than a byte. That's hardly true. If you try to manipulate any quantities larger than a byte, you need to have a well known way of breaking it down. The 6502 and 6800 were little endian as I recall, because 16-bit addresses were stored little-end first. Little endian has the additional advantage that numbers are stored in the order in which you operate on them. (Think about it... addition, at least as it was taught to you in grade school, is little endian.)
Now, didn't the later PDP / VAX machines do something funky with how 32-bit values got broken down into 8-bits? I seem to recall a "middle endian" with a 2,3,0,1 (or 1,0,3,2 depending on how you look at it) byte order because it was big endian between 16-bit halves of the 32-bit word and little-endian among the bytes in each half-word, or vice versa. That'd be where NUXI vs XINU comes in...
How an Accident of Hardware Design Encouraged Open Source (O'ReillyNet)
Posted Feb 28, 2007 19:27 UTC (Wed) by MBR (guest, #43632)
[Link]
See my comment below in response to "PDP-endian" byte order. The PDP-11 addressing for multi-word data really confused the issue. Note that the VAX was its own series and was not part of the PDP series. I think DEC may have fixed the PDP-11's weird low-endian byte order but big-endian word order when they designed the VAX architecture, but I'm not sure because by the time the VAX came along I was doing more coding in high level languages and not so much in assembly language.
How an Accident of Hardware Design Encouraged Open Source (O'ReillyNet)
Posted Mar 1, 2007 9:03 UTC (Thu) by charris (subscriber, #13263)
[Link]
I do recall that weird byte order for floating point numbers, but I don't remember it for integers.
How an Accident of Hardware Design Encouraged Open Source (O'ReillyNet)
Posted Mar 1, 2007 16:47 UTC (Thu) by brouhaha (subscriber, #1698)
[Link]
The 6800, 6809, and other 8-bit Motorola microprocessors were big-endian. As you say, the 6502 is little-endian.
The 68000 was interesting in that it was mostly big-endian, but the bit ordering used by the bit instructions was little-endian. When they added bitfield instructions to the 68020, with bitfields that could span words, they had to adopt big-endian bit numbering for those instructions to make it work well.
IBM, on the other hand, uses consistenly big-endian numbering for both bits and bytes. I've always thought it aggravating to deal with bytes and words that have bit zero as the most significant bit, but at least it is consistent.
How an Accident of Hardware Design Encouraged Open Source (O'ReillyNet)
Posted Mar 1, 2007 17:26 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246)
[Link]
Ok, I misremembered then. I didn't have much opportunity to program the 6800 series back in the day--just a few brushes with a 6875 and a 68HC11.
Big endian bit numbering is supremely annoying to me. The TI Home Computer numbered its busses in big-endian. A0 was the MSB of the address bus. Depending on whether you were looking at the 16-bit or 8-bit portion of the bus, the LSB was A14 or A15. (Yes, I did program 9900 assembly. It was my first assembly language.)
It works, but it only works well if you left-justify addresses as opposed to right-justifying them. And that only works if you know what your longest address is. What happens if the machine goes to 32 bits? :-)
(I'll stop now. I had written more, but thought better of it.)
Endian consistency
Posted Mar 3, 2007 5:41 UTC (Sat) by ldo (subscriber, #40946)
[Link]
IBM, on the other hand, uses consisten[t]ly big-endian numbering for both bits and bytes.
Actually, there is no completely consistent big-endian system. There are three different decisions to be made as far as endianness goes:
The ordering of bits in a byte.
The ordering of bytes in a multi-byte object.
The place value of binary digits making up an integer.
The IBM convention gets consistency among the first two, but not the third. The only way to get consistency among all three is to go little-endian.
Another point is that, even on big-endian architectures, registers tend to effectively be treated as little-endian--when you load/store quantities less than the full register size, you get the least-significant end of the register, not the most-significant end. Again, the only way to be completely consistent is to go little-endian.
How Open Formats Encouraged Open Source (O'ReillyNet)
Posted Feb 25, 2007 7:06 UTC (Sun) by jd (guest, #26381)
[Link]
There are three byte-ordering schemes listed in glibc - big-endian,
little-endian and PDP-endian. Last I looked, there were no Open Source
programs that supported PDP-endian or could even translate from it. It
makes me a little skeptical that DEC could have had this huge impact and
yet not be supported even on fairly old legacy Open Source code.
How an Accident of Hardware Design Encouraged Open Source (O'ReillyNet)
Posted Feb 26, 2007 19:11 UTC (Mon) by MBR (guest, #43632)
[Link]
The PDP-endian byte-ordering scheme is almost certainly a result of a design inconsistency in the PDP-11. The PDP-11 hardware designers decided to give low-order bytes lower addresses than high-order bytes, but were only thinking of a single 16-bit word when they made this decision. Somehow, they forgot about this when they had to deal with muti-word data like floats and doubles. Maybe a different hardware engineer did that part of the design and nobody noticed the inconsistency until too late. In any case, high-order 16-bit words within floats and doubles were stored before low-order words. So within a 16-bit word the PDP-11 was little-endian, but within multi-word units of data, the word-order was big-endian.
DEC dropped the prefix PDP (Programmable Data Processor) in favor of VAX (Virtual Address eXtension) when they designed the VAX series of computers. Assuming that the VAX engineers fixed the PDP-11's multi-word addressing glitch, I'm not at all surprised that you could find "no Open Source programs that supported PDP-endian or could even translate from it". That doesn't alter the fact that DEC's small cheap machines were ubiquitous on college campuses, and influenced the thinking of a large number of software engineers. From sometime in the 1970s through at least the late 1980s, DEC was the second only to IBM as the largest computer manufacturer in the world.
How an Accident of Hardware Design Encouraged Open Source (O'ReillyNet)
Posted Mar 1, 2007 7:18 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246)
[Link]
On VAX vs. PDP, you're mostly right. I found this amusing article which indicates that for integers, VAX was pure little endian, and for floating point it retained PDP endian. How's that for wacky?
Big endian vs little endian
Posted Mar 2, 2007 19:34 UTC (Fri) by giraffedata (subscriber, #1954)
[Link]
The article makes some crucial mistakes in describing why there are two systems.
It says the DEC engineers, in contrast to the IBM egineers, wanted to number the bytes the same as the bits. But IBM engineers also numbered the bytes the same as the bits. In all 360-related discussion, IBM numbers the most significant bit 0, just as the most significant byte.
It also talks about numbering bytes left to right, but there is no left and right in computer memory.
The crux of this conflict is that we write numbers so that you read the most significant digit earliest. Earlier times correlate to lower numbers, but higher digit significances are naturally higher numbers. IBM engineers decided to be compatible with the writing convention, while DEC engineers decided to be compatible with arithmetic. The DEC way makes for cleaner programs, but the IBM way is much easier for humans to visualize and talk about.
Besides, the byte ordering decision wasn't a decision about byte numbering; it's the other way around: the bytes were already numbered (addresses) and engineers had to decide in which byte to put the most significant bits -- the high-numbered one or the low-numbered one.
Big endian vs little endian
Posted Mar 6, 2007 9:03 UTC (Tue) by xoddam (subscriber, #2322)
[Link]
> (big-endian) is much easier for humans to visualize and talk about.
For "humans", substitute "people who use big-endian representations in their everyday written language" and you'll be correct. Numbers are little-endian in written Arabic and some other languages.
There is a totally inconsistent exception for *telephone* numbers. A telephone number is a sequence of digits, written from left to right *even in Arabic*.
Big endian vs little endian
Posted Mar 6, 2007 16:34 UTC (Tue) by giraffedata (subscriber, #1954)
[Link]
Numbers are little-endian in written Arabic and some other languages.
Thanks. I've wondered about that. In those languages, is there a more explicit form of writing numbers, such as the English "three thousand two hundred five"? If so, is it little endian or big endian?
There is a totally inconsistent exception for *telephone* numbers. A
telephone number is a sequence of digits, written from left to right
*even in Arabic*.
I don't see any inconsistency. A telephone number isn't a number; it's just a digit sequence. There's only one sane way to write a telephone number: in the order in which you dial it. And there was only one practical way to build the early telephone switches: most significant digit first.
I know of one such inconsistency, though: Some Chinese is written from right to left with numbers as Western numerals left to right. The reader skips ahead and reads the numeral most significant digit first (as it is spoken).
Big endian vs little endian
Posted Mar 8, 2007 20:20 UTC (Thu) by netizen (guest, #43966)
[Link]
I think it important to remember that when the IBM System/360 was announced on 01 APR 1964 as an all around scientific -and- commercial data processing device (yes, "data processing"; "information technology" had net even been coined, then) that the primary purpose for even commercial processing was for numerical processing! Text processing was just something else it did, some times with a move command, but usually with a sub-routine which utilized specialized text processing instructions. Also, while its native mode was Enhanced Binary Coded Decimal for Interchange Code -- EBCDIC -- all instructions had an ASCII-bit for switching to ASCII interpretation. ASCII was new at the time. From 1964 until today I have never heard of anyone actually using the ASCII-bit. EBDCIC was an extension/enhancement to the code structure which had been used in IBM's previous mainframes -- both commercial and scientific oriented number processors.
System/360 operated with 32-bit and 64-bit instructions. And, yes, some few instructions worked on only 16-bit or 8-bit at a time. A decimal number, e.g 750124, was converted, by the hardware into a series of hexadeciminal units. Hexadecimal had the advantage that the hex-bits laid out the same way, left-to-right, in HEX, BCD, EBCDIC, the way they were written on a bank check (for instance) and the same way they would have been punched into an IBM-card.
Inside the CPU a single instruction could fetch a customer's entire record. A second instruction could load the customer's account balance into a register of choice. The third instruction could subtract the customer's bank check payment from that resister, after which a fourth instruction stored the resulting new balance back into the customer's record. Finally a fifth instruction would write the updated record back onto a designated storage device. There was no bit manipulation. Just Bif!, Bam! And that was all using assembler language instructions!
Such was the power of a "big iron" instruction set with big-endian data representation. Multiple, i.e. sixteen, 32-bit registers cost big time and instruction set execution which could process such registers (or pairs of such registers for 64-bit operations) in one cycle was expensive.
The Intel 4000-series (4004, etc.) and 8000-series (8008, 8080, etc.) understandably took a different approach. IBM's main customers were primarily insurance companies which were drowning in a sea of paper record processing and desperately needed to automate or die. (Word was than many sent clerks to IBM programming school; those which passed still had jobs.)
Years later, Intel's 4000/8000 customers were, originally, vending machine manufactures who wanted to escape the break-down prone mechanical change-making processors which were then being used in machines which sold more than one-type, one-price product. Little-endian programming seemed appropriate for such (no insult intended) nickel-and-dime processing using a minimal instruction set. {Intel counted adding to A-reg and adding to B-reg as two different instructions; IBM counted a instruction which could add to any register, 0 through 15, as one instruction.]
IMHO, I prefer big-iron type powerful instruction sets which operate in one fell swoop on entire chunks of data using open-source code and open-format data representation. It takes billions of us to operate this planet and anyone who feels the need or urge ought to be able to have a fair go at it and, if successful leave behind a trail others can extend. {Well, that's my 64 bits worth! :) Thanks for reading if you stayed this far.}
Big endian vs little endian
Posted Mar 8, 2007 22:31 UTC (Thu) by giraffedata (subscriber, #1954)
[Link]
while its native mode was Enhanced Binary Coded Decimal for Interchange Code -- EBCDIC
I've heard this said before, but I've done a great deal of programming in later implementations of that architecture and I can't recall the CPU ever being cognizant of what character code I was using except in those instructions that convert between EBCDIC and ASCII. And ISTR there were some with which you could take advantage of the fact that the lower 4 bits of the code for a digit was also the binary reprentation of that digit. (That's true for EBCDIC and ASCII). How is EBCDIC native, and what did the ASCII bit do?
A decimal number, e.g 750124, was converted by the hardware into a series of hexadecimal units.
What is a hexadecimal unit? It sounds a lot like you're talking about what IBM calls packed decimal and everyone else calls BCD: a number code in which each decimal digit is represented in binary in 4 bits and those nybbles are lined up big-endian. But that has nothing to do with hexadecimal.
Also, that was one of two number codings used on S/360. The other was the big-endian pure binary code we've been talking about. Pure binary is easiest to do arithmetic on, but packed decimal is easiest to do input and output on. Many early applications were much more input and output than arithmetic.
Incidentally, Intel 8080 also had instructions for packed decimal/BCD.
the hex-bits laid out the same way, left-to-right, in HEX, BCD, EBCDIC,
But there is no left or right in computer memory, so this doesn't figure into the choice of big-endian or little-endian. Big-endian is not left to right. Big endian is the most significant byte in the location with the lowest address.
Little-endian programming seemed appropriate for such (no insult intended) nickel-and-dime processing using a minimal instruction set.
What is the connection between little-endian and minimal instruction set?
{Intel counted adding to A-reg and adding to B-reg as two different instructions; IBM counted a instruction which could add to any register, 0 through 15, as one instruction.]
I think you're really pointing out that IBM had expensive general purpose registers, while Intel had special purpose registers. In fact, the only Intel CPU of that era that I programmed (8080) had only one register you could add to: A (the accumulator). But I must miss your point anyway; why do we care how people classify the instructions?
Big endian vs little endian
Posted Mar 23, 2007 22:09 UTC (Fri) by BugLess (guest, #43869)
[Link]
"I don't see any inconsistency. A telephone number isn't a number; it's just a digit sequence. There's only one sane way to write a telephone number: in the order in which you dial it."
How does that make -any- sense? If you're reading right-to-left, you'd read the numbers right-to-left and dial right-to-left.
Big endian vs little endian
Posted Mar 24, 2007 0:22 UTC (Sat) by giraffedata (subscriber, #1954)
[Link]
"I don't see any inconsistency. A telephone number isn't a number; it's just a digit sequence. There's only one sane way to write a telephone number: in the order in which you dial it."
How does that make -any- sense? If you're reading right-to-left, you'd read the numbers right-to-left and dial right-to-left.
I failed to notice that the quote to which I was responding is contradictory. My statement makes sense if you believe the first half of it ("telephone numbers are an exception to writing numbers little-endian"), but nonsense if you believe the second half ("telephone numbers are left to right).
I confirmed at http://www2.ignatius.edu/faculty/turner/arabic/anumbers.htm that numbers in Arabic are written little-endian (least significant digit on the right).
Now the only question is what direction are telephone numbers written? Common sense tells me the "left to right" from the original is a typo and is supposed to say "right to left." That way, it is big-endian, which is inconsistent with the way numbers are written, but is in the order of dialing. Which would make my objection correct: there's no real inconsistency because telephone numbers aren't numbers.
(While telephone numbers aren't numbers, I consider the digits to have significance in the same way numbers do; the digits that select among the largest geographical area in the original geographical numbering system are the more significant).