Thursday, May 05, 2005

Hey baby, wanna swap bytes?


Well, today's task is to write a short conversion routine that will convert MATLAB (which I do not have) NMR data into nmrPipe format. It should be a fairly simple procedure as I have already dealt with most of the issues with other routines, and the file format for MATLAB files is available. It does involve byte swapping, however, which is a bit of a fiddly thing. Beware, if this kind of thing does not interest you you may want to save the next section for a time when you are having trouble sleeping.


In the computer world, everything you see is represented in the hardware by a small potential difference which is either on (1) or off (0). Since you can only represent two numbers by 1 or 0, you group these together in series to represent larger numbers. Usually 8 of these bits are grouped together to form one byte. For example, let's say you wanted to store the value 5 in memory. In this case you would turn on bits 1 and 3 of your byte, giving you 00000101 (i.e. 1 times 1, plus 0 times 2, plus 1 times 4, plus 0 times 8, etc.). A single byte can therefore only store numbers from 0 to 255, and is usually reserved for characters, of which there are less than 255. So if you wanted to store the phrase "Hi!", you would set an array of bytes in memory to 73,105,33, and that would store the phrase, or string. These bytes are usually not represented in binary, but in hexadecimal (base 16 represented by the numbers 0 through 9 and the letters a through f), so "Hi!" in hexadecimal would be 49 69 11. You can also represent numbers that are larger than 255 by grouping together bytes, two bytes can represent from 0 to 65535, or -32768 to 32767, and four bytes can represent 0 to 4,294,967,295 or -2,147,483,648 to 2,147,483,647. The problem that you come across when you try to store multiple byte values is what order you store the bytes in. In other words do you represent the integer 14 using four bytes by 00 00 00 0d or by 0d 00 00 00. Some computers store their data using the first ordering (called Big Endian, and is used by pretty much everyone but x86 type chips) and others use the second ordering (called little endian and is used by all of the Intel crowd). If you stuck to one platform you would never notice any of these issues as each operating system is designed to read and write bytes in the same way. But if you were to switch systems, then the bytes would appear to be backwards on the opposite system. This is one of the reasons that you cannot run Windows on an Apple computer, or Mac OS on a PC without an emulator. The emulator must swap all of the bytes going through the system so that the processor can deal with it. I run into this problem when someone sends me a binary file that has been saved on one platform, and I need to read it into my program on my platform. In order to do this I need to swap the bytes for each multi-byte number that I read in from the file so that they are represented correctly on my computer. The way that you do this is a bit underhanded and tricky. First you read in the number of bytes that represent the number (you need to have this beforehand by knowing the file format). Then you trick the computer into thinking that your multi-byte number is actually an array of single byte characters. Then you swap the first and last array elements, and the middle array elements accordingly. When you are done, your multi-byte number has had its bytes swapped and now represents the correct value for the platform you are on. It's a somewhat time consuming procedure, so it's best to not have to do it at all, but it is unavoidable when you deal with files from multiple computers.


At any rate, this is one of the things that I have to make sure is done correctly when I write my conversion routine. When I am done, my program will be just a little more flexible than it was before. This is my fun for the day.

No comments: