|
https://github.com/Rabrg/refactored-...atEncoder.java
Can anyone break down the code and explain it? I'm trying to understand it but I'm having trouble. Just explain it as you would explain it to yourself, no need to dumb it down.
What's the theory behind all of this?
Thanks for reading.
The basic idea is to take advantage of the fact that some characters appear more frequently than others in English text. The client uses this knowledge to encode the 13 most commonly used characters using only 4 bits instead of 8 bits. This lets you save a good deal of space in the general case and also when the message consists mostly of the common characters. Messages happen to be limited to 80 characters, though, so the savings aren't all that great (e.g. 40 bytes vs. 80 bytes for a message consisting entirely of the most frequent characters) in this case.
Most of the code is dedicated to dealing with packing the encoded characters into bytes while dealing with overflowed nibbles and isn't too important.
In general, making use of things like character frequencies and other text metadata can help save a lot of space. Here's some related pages on Wikipedia that might be interesting:
Letter frequency - Wikipedia, the free encyclopedia
Information theory - Wikipedia, the free encyclopedia
In addition, if you're interested, the newer clients make use of Huffman coding for text compression. It's pretty standard in more general compression algorithms too.
Thanks for the information. I'm sure other people will find it interesting as well.
well supah fly, as graham said, you can think of it as a basic variable-width encoding which gives priority (i.e. most frequent) characters singleton 4-bit codes while using 2 such codes to encode *less* frequent characters... it's a basic means of special-purpose text compression. in your code specifically, `characterBit` is actually the next nibble if input, while `validCharacterIndex` is sorta the lead unit, the first code used to encode one of the less frequent characters. if we have read a lead unit, it knows to read a trail unit and combine them (hence `(validCharacterIndex << 4) + characterBit`)
if you look at the order of the table, you should recognize immediately it's based on letter frequency ****ysis. pretty similar to etaoin shrdlu - Wikipedia, the free encyclopedia eh
« Previous Thread | Next Thread » |
Thread Information |
Users Browsing this ThreadThere are currently 1 users browsing this thread. (0 members and 1 guests) |