NCBIXML | Python で XML フォーマットの BLAST 結果を解析する方法

Python を利用して BLAST 検索の結果をパースする時 NCBIXML モジュールを利用する。BLAST 検索の結果はプレーンテキスト形式、タブ区切り形式や XML 形式などがある。プレーンテキスト形式は BLAST のバージョンにより異なっているため、パースするのに適していない。また、タブ区切り形式も様々な並び順が存在するため、パースするのに適していない。そのため、XML 形式が推奨されている。

BLAST 検索結果をパースした時、次のようなクラス構成でデータが保存される。（参照）

次の例は、XML 形式の出力結果をパースしている Python スクリプトである。

from Bio.Blast import NCBIXML

path = './ncbi-blast-result.xml'        

with open(path, mode = 'r', encoding = 'utf-8') as fh:
    blast_records = NCBIXML.parse(fh)
    for blast_record in blast_records:
        for alignment in blast_record.alignments:
             for hsp in alignment.hsps:
                 he = '>' + alignment.title + '|'
                 he += str(hsp.score) + '|'
                 he += str(hsp.bits) + '|'       
                 he += str(hsp.identities)
                 print(he)
                 print(hsp.query[1:80])
                 print(hsp.match[1:80])
                 print(hsp.sbjct[1:80])