BeautifulSoupが素晴らしいので
他にあまり解説サイトもないし、
簡単に使えそうなものを羅列してみた。

ただしデフォルトエンコーディングを設定している事が前提。

BeautifulSoupに渡されたHTMLは、
utf-8に文字コードを変換され~~自動で綺麗に生成しなおされる。~~
prettify()を使用すると、綺麗なソースに生成しなおされる。
※BeautifulSoup内で処理される場合は、このソースが元になる。

壊れたタグを修復という訳ではなさそうだが
改行やインデントを作り直してくれる。
Webアプリケーションとしても、かなり使えそう。

# -*- coding: utf-8 -*-

import string, re, urllib, urllib2

from BeautifulSoup import BeautifulSoup

url = 'http://www.google.co.jp/search?hl=ja&num=100&q='

# url = 'http://search.yahoo.co.jp/search?p=' # Yahooの場合

# クライアントを生成

opener = urllib2.build_opener()

# PythonのUAがGoogleに弾かれているようなので変更

opener.addheaders = [('User-Agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)')]

q = 'Perl使いのPythonちゃん' # 検索クエリ

q = urllib.quote(q) # クエリをURLエンコード

# Google検索結果を取得

print url+q

html = opener.open(url+q).read()

# HTMLソースをBeautifulSoupに渡す

soup = BeautifulSoup(html)

# エンコーディングを自動で取得

enc = soup.originalEncoding

print "Encoding : %s" % enc

# ソースを出力

# ※unicode(text, 'utf-8')はutf-8の文字をunicodeに変換

# print unicode(soup.prettify(), enc)

# ※UnicodeEncodeError: 'cp932' codec can't encode character...は、

#   Windowsの拡張文字でエラーが起こっているのかも。

#   その場合は'mbcs'というCodecを使う

#print unicode(soup.prettify(), enc).encode('mbcs')

### タグへのアクセス方法 - 1 (DOMもどきな扱い方)

print u'DOMもどきな扱い方'

# contents[0]で、その階層の1番上の要素にアクセス

# .name で、要素のタグ名を取得(小文字)

print soup.contents[0].name

# 結果はu'html'となる

# 更に子となる要素へアクセス(結果 : u'head')

print soup.contents[0].contents[0].name

# 要素を変数に格納

head = soup.contents[0].contents[0]

# 親要素にアクセス(結果 : u'html')

print head.parent.name

# 自分を含む、それ以下の階層の次の要素にアクセス(結果 : u'meta')

print head.next.name

# 同階層の次の要素にアクセス(結果 : u'body')

print head.nextSibling.name

# 複合(結果 : u'table')

print head.nextSibling.contents[0].name

# 複合(結果 : u'blockquote')

print head.nextSibling.contents[0].nextSibling.name

### タグへのアクセス方法 - 2 (よりDOMっぽい扱い方)

print u'DOMもどきな扱い方'

# タグ名でアクセス

titleTag = soup.html.head.title

# そのまま出力するとタグごと表示

print u"%s" % titleTag

# string でinnerText

print u"%s" % titleTag.string

# 要素数を取得

print len(soup('font'))

# 一致したタグの一番目を返す(属性で絞込みもできる)

soup.find('a')

# 一致したタグをリストにして返す(属性で絞込みもできる)

print soup.findAll('a', href="/")

# 正規表現さえ使える

print soup.find('a', href=re.compile('^/'))

# リストとしてアクセスも可能

print soup('a', href=re.compile('^/'))[1]

# 要素の値を返す

print soup('a', href=re.compile('^/'))[0]['href']

# 覚書 : xmlを解析する場合はBeautifulStoneSoup(xml)を使う？(未確認)

DOMのような感じで、置き換えたりタグを生成したりも
できるようだが、現状では必要ないのでまた今度。

ようし、飽きた！
これらを駆使してサンプルでも作ってみる。

Abstract

The Comma Separated Values (CSV) file format is the most common import and export format for spreadsheets and databases. Although many CSV files are simple to parse, the format is not formally defined by a stable specification and is subtle enough that parsing lines of a CSV file with something like line.split(",") is eventually bound to fail. This PEP defines an API for reading and writing CSV files. It is accompanied by a corresponding module which implements the API.

4.3. 取り出さないグループと名前付きグループ

init はインスタンスを作った直後に呼ばれる特殊メソッドです（省略可能だが、実質的なコンストラクター）。

特殊メソッド

実行結果

リファレンス

BeautifulSoupでHTML解析

カテゴリ:

トラックバック(0)

コメントする

カテゴリ

月別アーカイブ

ウェブページ

PR

検索

PR

このブログ記事について

http://www.python.jp/Zope/articles/tips/regex_howto/regex_howto_3

3.1. 正規表現のコンパイル

3.2. バックスラッシュだらけ

3.3. マッチングを行う

3.4. モジュールレベルの関数

3.5. コンパイル時のフラグ

Abstract

To Do (Notes for the Interested and Ambitious)

Application Domain

Rationale

Existing Modules

Module Interface

Reading CSV Files

Writing CSV Files

Managing Different Dialects

Formatting Parameters

Reader Objects

Writer Objects

Implementation

Testing

Issues

References

Copyright