BW変換（Burrows Wheeler Transform）を書いた

「高速文字列解析の世界」を読んでて，BW変換のところがよくわからなかったので，実際に書いてみました．

高速文字列解析の世界――データ圧縮・全文検索・テキストマイニング (確率と情報の科学)

作者: 岡野原大輔
出版社/メーカー: 岩波書店
発売日: 2012/12/27
メディア: 単行本
購入: 15人クリック: 324回
この商品を含むブログ (5件) を見る

概要

BW変換ってなんぞやを一言でいうと，「変換すると，似たような記号がたくさん並ぶようになる，文字列の可逆変換」です．具体例をみせると，以下はWikipediaのBW変換の最初のパラグラフを，実際にBW変換してみた結果です．

元の文章*1

Compression techniques work by finding repeated patterns in the data and encoding the duplications more compactly. The Burrows Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data. The BWT is thus a free method of improving the efficiency of text compression algorithms, costing only some extra computation.

変換後の文章*2

mee....ssssyn,amehodksoadreotgra,eleeedheylatefsgnTseesstdssygd) fgsdsof,yorasyntfgngg, ssygrtes,s. nTsnyeonkegaysn ( $ WW BB t r trrrp n c rrrrt lhhhe hehddeemtcp i inu eeionn aaaaneenoen anoeodn hhrrhhbhmrcehlvr ppttltternh se itlrrtltttvguurrrrrmrt ooooef sss -ennnnnnnnnnlnctccc tTTttttWTtcctttslfcms sftrtdrdvdd stssstttnnh rwdrcauibl-eapabanttrrro i oioooooiohoooioouooieeeaaie iiiiiiiiiaehhorouuaaaoaiatsttttlhcc scccCccciiiiiii irftmMgwfffpschmrrm eeummmmmmmmiiaoeetaaartttooo f ppppp ttoooooefpraueeeoo weusrninanniaeidiemrunnnr ssssl -eeeeeo axauinaaru aa ctcc g iei rsaaianc n -sx ssasqqfrrrdBh opoeo oeenbclbsll

ということで，変換すると同じような記号がダーっと並んでるのが見て取れると思います．これの何が嬉しいかというと，同じような文字が連続していると，文字列を圧縮するときに圧縮効率が良くなります*3．

実装

ということで，BWTをするのと，元に戻すのを実際に実装してみました．とはいえ理解のための実装なので，パフォーマンスについては度外視してます．あと，終端文字として$を使っているので，文章中に$があるとうまく変換できません．

これを実際に実行してみると，以下のように可逆に変換されました．

python bwt.py abracadabra
ard$rcaaaabb
abracadabra

本の簡潔データ構造とかちゃんと読んでから，もう一回パフォーマンスチューニングしますかね...

参考

詳しい説明やパフォーマンス改善については，以下の2つの記事を読んでもらえるとよいと思います．特にLFマッピングの部分は，コードだけだと意味がわからない筈...

*1:実はハイフンがうまく処理できなかったので，「Burrows-Wheeler transform」を「Burrows Wheeler transform」にこっそりと変えてあります．

*2:blogのフォーマットにキレイに収まるように，半角スペースを3個くらい付け加えてあります．

*3:wikipediaにあるように実際にbzip2の圧縮アルゴリズムの中で使われています．

About connecting the dots.

data science related trivial things

BW変換（Burrows Wheeler Transform）を書いた

概要

元の文章*1

変換後の文章*2

実装

参考