姌是什么意思| 女人脸黄是什么原因该怎么调理| 急性咽喉炎吃什么药好得快| 怎么算自己五行缺什么| 梦见卖鱼是什么意思| 梦见杀人是什么预兆| 织锦是什么面料| 气泡水是什么水| 浑身出汗是什么原因| 336是什么意思| 你的美丽让你带走是什么歌| 自相矛盾什么意思| 红眼病用什么药| 症瘕积聚是什么意思| 瓜子脸适合剪什么刘海| 阿司匹林什么时候吃最好| 痢疾吃什么药效果最好| 青少年手抖是什么原因| 小孩为什么会细菌感染| 敖是什么意思| 后背疼痛挂什么科| 房颤挂什么科| 什么床品牌最好| k粉是什么| 外公是什么关系| 落魄是什么意思| 看脑血管挂什么科| 头痛是什么原因造成的| 丙辰是什么时辰| 镭是什么| 忘带洗面奶用什么代替| 三国时期是什么朝代| 胃怕凉怕冷是什么原因| 做梦梦到搬家什么意思| 甲状腺吃什么盐| 精囊炎吃什么药最有效| 嗓子有异物感吃什么药| 为什么会眼压高| 12月27日是什么星座| 三花五罗都是什么鱼| 什么是黄道吉日| 肠道感染有什么症状| 阳春三月是什么生肖| 雪菊有什么功效| 硒片不适合什么人吃| 口腔溃疡用什么药好得快| 眉毛脱落是什么原因造成的| 照身份证穿什么衣服| 人为什么会做春梦| 左卵巢内囊性结构什么意思| 犬和狗有什么区别| 脾气暴躁易怒是什么病| 皮肤黑的人穿什么颜色的衣服显白| 天牛吃什么| 梦见死蛇是什么预兆| 蚝油可以用什么代替| 什么是肾上腺素| 什么是银屑病| 欲是什么意思| 长期喝奶粉有什么好处| 6月25什么星座| visa是什么| lh是什么激素| 甘油三酯代表什么| 降逆是什么意思| 吃什么去黄气美白| 农历五月是什么月| 为什么会长息肉| 绞股蓝长什么样| 么么什么意思| 手麻什么原因| 什么疲倦| 为什么左手会发麻| 吃什么东西可以减肥| 玄米是什么米| 菠萝蜜是什么季节的水果| 令香是什么意思| 蒲公英有什么药效| 腿脚发麻是什么原因| 尿检能查出什么| ppt是什么单位| ts是什么| 贞操锁是什么| 男的有霉菌是什么症状| 弱精是什么意思| 冰箱冷藏室结冰是什么原因| 怀孕第一天有什么症状| 梦见给别人剪头发是什么意思| 豌豆有什么营养价值| 生粉和淀粉有什么区别| 一念之间什么意思| 四月二十四是什么星座| 做包子用什么面粉好| 俄罗斯为什么要打乌克兰| 小孩口腔溃疡是什么原因引起的| 整装是什么意思| 什么叫情劫| 1947年属什么生肖| 甘油三酯高有什么症状| hlh是什么病| 川芎有什么功效| 胃胀气是什么原因引起的| 什么瓜不能吃脑筋急转弯| hcc是什么意思| 促甲状腺激素低是什么原因| 上元节是什么节日| 传票是什么意思| 心脏做什么检查最准确| 喝中药尿黄是什么原因| 众什么意思| 处女女和什么星座最配| 夸奖的近义词是什么| 吐露是什么意思| 氯超标是因为什么原因| 沙拉酱是用什么做的| 为什么一同房就出血| 夜卧早起是什么意思| rh因子阳性是什么意思| 人工降雨的原理是什么| 狭隘是什么意思| 照看是什么意思| 铁瓷是什么意思| 1987年属什么的| 怀孕做无创是查什么| 牙齿为什么会松动| 为什么会得面瘫| 慢性肠炎吃什么药| 秋天有什么水果成熟| 吃什么补维生素b| 众里寻他千百度是什么意思| 便秘喝什么药| 金牛座是什么象| jps是什么意思| 三星是什么军衔| 儿童吃什么钙片补钙效果好| 水中毒是什么| 乙醇对人体有什么伤害| 脑梗有什么后遗症| 乳头湿疹用什么药| 同等学力是什么意思| 排卵试纸什么时候测最准确| 乳房胀痛挂什么科| 血肌酐高吃什么食物| 六月种什么菜| 男人前列腺炎有什么症状表现| 五行缺什么怎么查询| 夏至喝什么汤| 国家为什么重视合肥| 甲状腺结节有什么症状表现| 口水多是什么原因引起的| 双皮奶是什么做的| 尿道感染是什么症状| 人均gdp是什么意思| cr医学上是什么意思| 儿童身份证需要什么材料| 胃寒喝什么茶暖胃养胃| 游走性疼痛挂什么科| 岁运并临是什么意思| 口苦是什么病| hpv52阳性有什么症状| 什么是白内障症状| 周星驰为什么不结婚| 抓龙筋什么意思| 脱敏处理是什么意思| 谷草谷丙偏高是什么意思| 卵子排出体外是什么样子| 施华洛世奇算什么档次| 肌酐高有什么危害| 沙悟净是什么生肖| 右边脸疼是什么原因| 消融术是什么手术| 嘴唇上起泡是什么原因| 蛋白质是什么食物| 为什么十个络腮九个帅| 不放屁吃什么药能通气| 熟的反义词是什么| 过敏看什么科| 例行检查是什么意思| 长卿是什么意思| 1999年发生了什么| 什么水果最老实| 什么的眼光| 必要条件是什么意思| 路由器坏了有什么症状| 心理医生挂什么科| 头顶出汗是什么原因| 三角梅用什么肥料最好| 什么来迟| 凉粉什么做的| 3月12日是什么星座| 什么是琉璃| 业火是什么意思| 体检应该挂什么科| 7月7日是什么星座| 胸部dr是什么| 菓是什么意思| futa是什么意思| 有头皮屑用什么洗发水| pci是什么| 翡翠五行属什么| 生扶什么意思| 盆腔炎做什么检查| 高祖父的爸爸叫什么| 尿酸高吃什么药降尿酸效果好| 七月七是什么日子| 苏州古代叫什么| 提手旁加茶念什么| 宝宝消化不好吃什么调理| 4月15日什么星座| 一个尔一个玉念什么| 卡介疫苗是预防什么的| 什么啤酒最好喝| 怪是什么意思| 神龙见首不见尾是什么意思| 有色眼镜是什么意思| m蛋白是什么| 5岁属什么| 佑是什么意思| 藏毛窦是什么病| 出国要办什么证件| 肾虚是什么症状| 寒湿重吃什么药| 头大适合什么发型| 黄风怪是什么动物| 4.24是什么星座| 极性什么意思| 好朋友是什么意思| 狗狗可以吃什么| 梦见黑蛇是什么预兆| 青光眼什么症状| 鸟加衣念什么| g是什么牌子| 脾虚喝什么泡水比较好| 无极调光是什么意思| 日本为什么偷袭珍珠港| 苡米和薏米有什么区别| 狗为什么会咬人| 世界上最软的东西是什么| eb病毒igg抗体阳性是什么意思| 血窦是什么意思| 理化检验主要检验什么| 柳树像什么| 乳头经常痒是什么原因| dha什么时间段吃最好| 安睡裤是什么| 门昌念什么| 红豆吃多了有什么坏处| 加湿器什么季节用最好| 市政府秘书长什么级别| 午夜梦回是什么意思| 吃什么对肝脏有好处能养肝| 莱字五行属什么| 自汗恶风是什么意思| 鲶鱼效应是什么意思| 疖肿是什么原因引起的| 半夜12点是什么时辰| 刘备是什么样的人| 女人什么时候绝经正常| 嗜酸性粒细胞偏低是什么意思| 橘红是什么东西| 忆苦思甜下一句是什么| 杰五行属性是什么| 陶和瓷有什么区别| 忘乎所以是什么意思| 百度Jump to content

2014城市发展质量论坛 暨全国首批民生改善典范城市发布会

From Wikipedia, the free encyclopedia
百度 2017年8月17日18时31分,中石油大连石化公司第二联合车间140万吨/年重油催化裂化装置泄漏并引发火灾。

Huffman tree generated from the exact frequencies of the text "this is an example of a huffman tree". Encoding the sentence with this code requires 135 (or 147) bits, as opposed to 288 (or 180) bits if 36 characters of 8 (or 5) bits were used (This assumes that the code tree structure is known to the decoder and thus does not need to be counted as part of the transmitted information). The frequencies and codes of each character are shown in the accompanying table.
Char Freq Code
space 7 111
a 4 010
e 4 000
f 3 1101
h 2 1010
i 2 1000
m 2 0111
n 2 0010
s 2 1011
t 2 0110
l 1 11001
o 1 00110
p 1 10011
r 1 11000
u 1 00111
x 1 10010

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".[1]

The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in time linear to the number of input weights if these weights are sorted.[2] However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods – it is replaced with arithmetic coding[3] or asymmetric numeral systems[4] if a better compression ratio is required.

History

[edit]

In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.[5]

In doing so, Huffman outdid Fano, who had worked with Claude Shannon to develop a similar code. Building the tree from the bottom up guaranteed optimality, unlike the top-down approach of Shannon–Fano coding.

Terminology

[edit]

Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free codes", that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol). Huffman coding is such a widespread method for creating prefix codes that the term "Huffman code" is widely used as a synonym for "prefix code" even when such a code is not produced by Huffman's algorithm.

Problem definition

[edit]
Constructing a Huffman tree

Informal description

[edit]
Given
A set of symbols and for each symbol , the frequency representing the fraction of symbols in the text that are equal to .[6]
Find
A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root).

Formalized description

[edit]

Input.
Alphabet , which is the symbol alphabet of size .
Tuple , which is the tuple of the (positive) symbol weights (usually proportional to probabilities), i.e. .

Output.
Code , which is the tuple of (binary) codewords, where is the codeword for .

Goal.
Let be the weighted path length of code . Condition: for any code .

Example

[edit]

We give an example of the result of Huffman coding for a code with five characters and given weights. We will not verify that it minimizes L over all codes, but we will compute L and compare it to the Shannon entropy H of the given set of weights; the result is nearly optimal.

Input (A, W) Symbol (ai) a b c d e Sum
Weights (wi) 0.10 0.15 0.30 0.16 0.29 = 1
Output C Codewords (ci) 010 011 11 00 10  
Codeword length (in bits)
(?i)
3 3 2 2 2
Contribution to weighted path length
(?i wi )
0.30 0.45 0.60 0.32 0.58 L(C) = 2.25
Optimality Probability budget
(2??i)
1/8 1/8 1/4 1/4 1/4 = 1.00
Information content (in bits)
(?log2 wi) ≈
3.32 2.74 1.74 2.64 1.79  
Contribution to entropy
(?wi log2 wi)
0.332 0.411 0.521 0.423 0.518 H(A) = 2.205

For any code that is biunique, meaning that the code is uniquely decodeable, the sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is strictly equal to one; as a result, the code is termed a complete code. If this is not the case, one can always derive an equivalent code by adding extra symbols (with associated null probabilities), to make the code complete while keeping it biunique.

As defined by Shannon (1948), the information content h (in bits) of each symbol ai with non-null probability is

The entropy H (in bits) is the weighted sum, across all symbols ai with non-zero probability wi, of the information content of each symbol:

(Note: A symbol with zero probability has zero contribution to the entropy, since . So for simplicity, symbols with zero probability can be left out of the formula above.)

As a consequence of Shannon's source coding theorem, the entropy is a measure of the smallest codeword length that is theoretically possible for the given alphabet with associated weights. In this example, the weighted average codeword length is 2.25 bits per symbol, only slightly larger than the calculated entropy of 2.205 bits per symbol. So not only is this code optimal in the sense that no other feasible code performs better, but it is very close to the theoretical limit established by Shannon.

In general, a Huffman code need not be unique. Thus the set of Huffman codes for a given probability distribution is a non-empty subset of the codes minimizing for that probability distribution. (However, for each minimizing codeword length assignment, there exists at least one Huffman code with those lengths.)

Basic technique

[edit]

Compression

[edit]
Visualisation of the use of Huffman coding to encode the message "A_DEAD_DAD_?CEDED_A_BAD_?BABE_A_BEADED_?ABACA_BED". In steps 2 to 6, the letters are sorted by increasing frequency, and the least frequent two at each step are combined and reinserted into the list, and a partial tree is constructed. The final tree in step 6 is traversed to generate the dictionary in step 7. Step 8 uses it to encode the message.
A source generates 4 different symbols with probability . A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:
Symbol Code
a1 0
a2 10
a3 110
a4 111
The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the entropy of the source is 1.74 bits/symbol. If this Huffman code is used to represent the signal, then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit because the probabilities of the symbols are different from negative powers of two.

The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols, . A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain a weight, links to two child nodes and an optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to leaf nodes and internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.

The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root of the Huffman tree.

The simplest construction algorithm uses a priority queue where the node with lowest probability is given highest priority:

  1. Create a leaf node for each symbol and add it to the priority queue.
  2. While there is more than one node in the queue:
    1. Remove the two nodes of highest priority (lowest probability) from the queue
    2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
    3. Add the new node to the queue.
  3. The remaining node is the root node and the tree is complete.

Since efficient priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n?1 nodes, this algorithm operates in O(n log n) time, where n is the number of symbols.

If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues:

  1. Start with as many leaves as there are symbols.
  2. Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
  3. While there is more than one node in the queues:
    1. Dequeue the two nodes with the lowest weight by examining the fronts of both queues.
    2. Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
    3. Enqueue the new node into the rear of the second queue.
  4. The remaining node is the root node; the tree has now been generated.

Once the Huffman tree has been generated, it is traversed to generate a dictionary which maps the symbols to binary codes as follows:

  1. Start with current node set to the root.
  2. If node is not a leaf node, label the edge to the left child as 0 and the edge to the right child as 1. Repeat the process at both the left child and the right child.

The final encoding of any symbol is then read by a concatenation of the labels on the edges along the path from the root node to the symbol.

In many cases, time complexity is not very important in the choice of algorithm here, since n here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when n grows to be very large.

It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code.

Decompression

[edit]

Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value). Before this can take place, however, the Huffman tree must be somehow reconstructed. In the simplest case, where character frequencies are fairly predictable, the tree can be preconstructed (and even statistically adjusted on each compression cycle) and thus reused every time, at the expense of at least some measure of compression efficiency. Otherwise, the information to reconstruct the tree must be sent a priori. A naive approach might be to prepend the frequency count of each character to the compression stream. Unfortunately, the overhead in such a case could amount to several kilobytes, so this method has little practical use. If the data is compressed using canonical encoding, the compression model can be precisely reconstructed with just bits of information (where B is the number of bits per symbol). Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered the tree building routine simply reads the next 8 bits to determine the character value of that particular leaf. The process continues recursively until the last leaf node is reached; at that point, the Huffman tree will thus be faithfully reconstructed. The overhead using such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet). Many other techniques are possible as well. In any case, since the compressed data can include unused "trailing bits" the decompressor must be able to determine when to stop producing output. This can be accomplished by either transmitting the length of the decompressed data along with the compression model or by defining a special code symbol to signify the end of input (the latter method can adversely affect code length optimality, however).

Main properties

[edit]

The probabilities used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. This requires that a frequency table must be stored with the compressed text. See the Decompression section above for more information about the various techniques employed for this purpose.

Optimality

[edit]

Huffman's original algorithm is optimal for a symbol-by-symbol coding with a known input probability distribution, i.e., separately encoding unrelated symbols in such a data stream. However, it is not optimal when the symbol-by-symbol restriction is dropped, or when the probability mass functions are unknown. Also, if symbols are not independent and identically distributed, a single code may be insufficient for optimality. Other methods such as arithmetic coding often have better compression capability.

Although both aforementioned methods can combine an arbitrary number of symbols for more efficient coding and generally adapt to the actual input statistics, arithmetic coding does so without significantly increasing its computational or algorithmic complexities (though the simplest version is slower and more complex than Huffman coding). Such flexibility is especially useful when input probabilities are not precisely known or vary significantly within the stream. However, Huffman coding is usually faster and arithmetic coding was historically a subject of some concern over patent issues. Thus many technologies have historically avoided arithmetic coding in favor of Huffman and other prefix coding techniques. As of mid-2010, the most commonly used techniques for this alternative to Huffman coding have passed into the public domain as the early patents have expired.

For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding, e.g., ASCII coding. This reflects the fact that compression is not possible with such an input, no matter what the compression method, i.e., doing nothing to the data is the optimal thing to do.

Huffman coding is optimal among all methods in any case where each input symbol is a known independent and identically distributed random variable having a probability that is dyadic. Prefix codes, and thus Huffman coding in particular, tend to have inefficiency on small alphabets, where probabilities often fall between these optimal (dyadic) points. The worst case for Huffman coding can happen when the probability of the most likely symbol far exceeds 2?1 = 0.5, making the upper limit of inefficiency unbounded.

There are two related approaches for getting around this particular inefficiency while still using Huffman coding. Combining a fixed number of symbols together ("blocking") often increases (and never decreases) compression. As the size of the block approaches infinity, Huffman coding theoretically approaches the entropy limit, i.e., optimal compression.[7] However, blocking arbitrarily large groups of symbols is impractical, as the complexity of a Huffman code is linear in the number of possibilities to be encoded, a number that is exponential in the size of a block. This limits the amount of blocking that is done in practice.

A practical alternative, in widespread use, is run-length encoding. This technique adds one step in advance of entropy coding, specifically counting (runs) of repeated symbols, which are then encoded. For the simple case of Bernoulli processes, Golomb coding is optimal among prefix codes for coding run length, a fact proved via the techniques of Huffman coding.[8] A similar approach is taken by fax machines using modified Huffman coding. However, run-length coding is not as adaptable to as many input types as other compression technologies.

Variations

[edit]

Many variations of Huffman coding exist,[9] some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be polynomial time.

n-ary Huffman coding

[edit]

The n-ary Huffman algorithm uses an alphabet of size n, typically {0, 1, ..., n-1}, to encode messages and build an n-ary tree. This approach was considered by Huffman in his original paper. The same algorithm applies as for binary () codes, but instead of combining the two least likely symbols, the n least likely symbols are grouped together.

Note that for n > 2, not all sets of source words can properly form a complete n-ary tree for Huffman coding. In these cases, additional placeholder symbols with 0 probability may need to be added. This is because the structure of the tree needs to repeatedly join n branches into one - also known as an "n to 1" combination. For binary coding, this is a "2 to 1" combination, which works with any number of symbols. For n-ary coding, a complete tree is only possible when the total number of symbols (real + placeholders) leaves a remainder of 1 when divided by (n-1). [1]

Adaptive Huffman coding

[edit]

A variation called adaptive Huffman coding involves calculating the probabilities dynamically based on recent actual frequencies in the sequence of source symbols, and changing the coding tree structure to match the updated probability estimates. It is used rarely in practice, since the cost of updating the tree makes it slower than optimized adaptive arithmetic coding, which is more flexible and has better compression.[citation needed]

Huffman template algorithm

[edit]

Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only that the weights form a totally ordered commutative monoid, meaning a way to order weights and to add them. The Huffman template algorithm enables one to use any kind of weights (costs, frequencies, pairs of weights, non-numerical weights) and one of many combining methods (not just addition). Such algorithms can solve other minimization problems, such as minimizing , a problem first applied to circuit design.

Length-limited Huffman coding/minimum variance Huffman coding

[edit]

Length-limited Huffman coding is a variant where the goal is still to achieve a minimum weighted path length, but there is an additional restriction that the length of each codeword must be less than a given constant. The package-merge algorithm solves this problem with a simple greedy approach very similar to that used by Huffman's algorithm. Its time complexity is , where is the maximum length of a codeword. No algorithm is known to solve this problem in or time, unlike the presorted and unsorted conventional Huffman problems, respectively.

Huffman coding with unequal letter costs

[edit]

In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.

Huffman coding with unequal letter costs is the generalization without this assumption: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of Morse code, where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message. No algorithm is known to solve this in the same manner or with the same efficiency as conventional Huffman coding, though it has been solved by Richard M. Karp[10] whose solution has been refined for the case of integer costs by Mordecai J. Golin.[11]

Optimal alphabetic binary trees (Hu–Tucker coding)

[edit]

In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example, could not be assigned code , but instead should be assigned either or . This is also known as the Hu–Tucker problem, after T. C. Hu and Alan Tucker, the authors of the paper presenting the first -time solution to this optimal binary alphabetic problem,[12] which has some similarities to Huffman algorithm, but is not a variation of this algorithm. A later method, the Garsia–Wachs algorithm of Adriano Garsia and Michelle L. Wachs (1977), uses simpler logic to perform the same comparisons in the same total time bound. These optimal alphabetic binary trees are often used as binary search trees.[13]

The canonical Huffman code

[edit]

If weights corresponding to the alphabetically ordered inputs are in numerical order, the Huffman code has the same lengths as the optimal alphabetic code, which can be found from calculating these lengths, rendering Hu–Tucker coding unnecessary. The code resulting from numerically (re-)ordered input is sometimes called the canonical Huffman code and is often the code used in practice, due to ease of encoding/decoding. The technique for finding this code is sometimes called Huffman–Shannon–Fano coding, since it is optimal like Huffman coding, but alphabetic in weight probability, like Shannon–Fano coding. The Huffman–Shannon–Fano code corresponding to the example is , which, having the same codeword lengths as the original solution, is also optimal. But in canonical Huffman code, the result is .

Applications

[edit]

Arithmetic coding and Huffman coding produce equivalent results — achieving entropy — when every symbol has a probability of the form 1/2k. In other circumstances, arithmetic coding can offer better compression than Huffman coding because — intuitively — its "code words" can have effectively non-integer bit lengths, whereas code words in prefix codes such as Huffman codes can only have an integer number of bits. Therefore, a code word of length k only optimally matches a symbol of probability 1/2k and other probabilities are not represented optimally; whereas the code word length in arithmetic coding can be made to exactly match the true probability of the symbol. This difference is especially striking for small alphabet sizes.[citation needed]

Prefix codes nevertheless remain in wide use because of their simplicity, high speed, and lack of patent coverage. They are often used as a "back-end" to other compression methods. Deflate (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by the use of prefix codes; these are often called "Huffman codes" even though most applications use pre-defined variable-length codes rather than codes designed using Huffman's algorithm.

References

[edit]
  1. ^ a b Huffman, D. (1952). "A Method for the Construction of Minimum-Redundancy Codes" (PDF). Proceedings of the IRE. 40 (9): 1098–1101. doi:10.1109/JRPROC.1952.273898.
  2. ^ Van Leeuwen, Jan (1976). "On the construction of Huffman trees" (PDF). ICALP: 382–410. Retrieved 2025-08-06.
  3. ^ Ze-Nian Li; Mark S. Drew; Jiangchuan Liu (2025-08-06). Fundamentals of Multimedia. Springer Science & Business Media. ISBN 978-3-319-05290-8.
  4. ^ J. Duda, K. Tahboub, N. J. Gadil, E. J. Delp, The use of asymmetric numeral systems as an accurate replacement for Huffman coding, Picture Coding Symposium, 2015.
  5. ^ Huffman, Ken (1991). "Profile: David A. Huffman: Encoding the "Neatness" of Ones and Zeroes". Scientific American: 54–58.
  6. ^ Kleinberg, Jon; Tardos, Eva (2025-08-06). Algorithm Design (1 ed.). Pearson Education. p. 165. ISBN 9780321295354. Retrieved 2025-08-06.
  7. ^ Gribov, Alexander (2025-08-06). "Optimal Compression of a Polyline with Segments and Arcs". arXiv:1604.07476 [cs.CG].
  8. ^ Gallager, R.G.; van Voorhis, D.C. (1975). "Optimal source codes for geometrically distributed integer alphabets". IEEE Transactions on Information Theory. 21 (2): 228–230. doi:10.1109/TIT.1975.1055357.
  9. ^ Abrahams, J. (2025-08-06). "Code and parse trees for lossless source encoding". Written at Arlington, VA, USA. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171). Division of Mathematics, Computer & Information Sciences, Office of Naval Research (ONR). Salerno: IEEE. pp. 145–171. CiteSeerX 10.1.1.589.4726. doi:10.1109/SEQUEN.1997.666911. ISBN 0-8186-8132-2. S2CID 124587565.
  10. ^ Karp, Richard M. (2025-08-06). "Minimum-redundancy coding for the discrete noiseless channel". IRE Transactions on Information Theory. 7 (1). IEEE: 27–38. doi:10.1109/TIT.1961.1057615.
  11. ^ Golin, Mordekai J. (January 1998). "A Dynamic Programming Algorithm for Constructing Optimal Prefix-Free Codes with Unequal Letter Costs" (PDF). IEEE Transactions on Information Theory. 44 (5) (published 2025-08-06): 1770–1781. doi:10.1109/18.705558. S2CID 2265146. Retrieved 2025-08-06.
  12. ^ Hu, T. C.; Tucker, A. C. (1971). "Optimal Computer Search Trees and Variable-Length Alphabetical Codes". SIAM Journal on Applied Mathematics. 21 (4): 514. doi:10.1137/0121057. JSTOR 2099603.
  13. ^ Knuth, Donald E. (1998), "Algorithm G (Garsia–Wachs algorithm for optimum binary trees)", The Art of Computer Programming, Vol. 3: Sorting and Searching (2nd ed.), Addison–Wesley, pp. 451–453. See also History and bibliography, pp. 453–454.

Bibliography

[edit]
[edit]
检查头部挂什么科 甲沟炎涂抹什么药膏最有效 双亲是什么意思 市政府办公室主任是什么级别 古惑仔是什么
今天美国什么节日 乳腺纤维瘤是什么原因引起的 医学上是什么意思 脚抽筋吃什么药 为什么医生都穿洞洞鞋
7月24号是什么星座 菩提萨婆诃是什么意思 什么叫闺蜜 整装待发是什么意思 把妹是什么意思
胸膜增厚吃什么药 氟苯尼考兽药治什么病 际遇是什么意思 血滴子是什么意思 pd1是什么意思
红和绿混合是什么颜色hcv8jop8ns5r.cn 调和营卫是什么意思hcv7jop5ns3r.cn 中班小朋友应该学什么hcv8jop6ns2r.cn 高血压适合吃什么水果hcv9jop0ns7r.cn 毛重是什么hcv9jop4ns5r.cn
特警力量第二部叫什么hcv9jop5ns7r.cn 一个月来两次例假是什么原因hcv8jop1ns4r.cn 偏头痛是什么原因sscsqa.com 女人更年期吃什么药调理最好hcv7jop5ns2r.cn 旅游的意义是什么gangsutong.com
amp是什么意思hcv8jop2ns8r.cn 梦见小牛犊是什么预兆hcv9jop1ns4r.cn 痔疮不治会有什么危害hcv9jop3ns3r.cn 一动就大汗淋漓是什么原因wzqsfys.com 疟疾病是什么病imcecn.com
汪小菲什么星座hcv7jop6ns4r.cn 生孩子送什么花比较好hcv7jop9ns9r.cn 夏天喝什么茶naasee.com 牛奶可以做什么甜品hcv8jop3ns1r.cn 什么样的人容易抑郁hcv8jop7ns0r.cn
百度