2008年11月14日 開始
× [PR]上記の広告は3ヶ月以上新規記事投稿のないブログに表示されています。新しい記事を書く事で広告が消えます。 http://www.python.org/dev/peps/pep-0305/
Contents AbstractThe Comma Separated Values (CSV) file format is the most common import and export format for spreadsheets and databases. Although many CSV files are simple to parse, the format is not formally defined by a stable specification and is subtle enough that parsing lines of a CSV file with something like line.split(",") is eventually bound to fail. This PEP defines an API for reading and writing CSV files. It is accompanied by a corresponding module which implements the API. To Do (Notes for the Interested and Ambitious)
Application DomainThis PEP is about doing one thing well: parsing tabular data which may use a variety of field separators, quoting characters, quote escape mechanisms and line endings. The authors intend the proposed module to solve this one parsing problem efficiently. The authors do not intend to address any of these related topics:
RationaleOften, CSV files are formatted simply enough that you can get by reading them line-by-line and splitting on the commas which delimit the fields. This is especially true if all the data being read is numeric. This approach may work for awhile, then come back to bite you in the butt when somebody puts something unexpected in the data like a comma. As you dig into the problem you may eventually come to the conclusion that you can solve the problem using regular expressions. This will work for awhile, then break mysteriously one day. The problem grows, so you dig deeper and eventually realize that you need a purpose-built parser for the format. CSV formats are not well-defined and different implementations have a number of subtle corner cases. It has been suggested that the "V" in the acronym stands for "Vague" instead of "Values". Different delimiters and quoting characters are just the start. Some programs generate whitespace after each delimiter which is not part of the following field. Others quote embedded quoting characters by doubling them, others by prefixing them with an escape character. The list of weird ways to do things can seem endless. All this variability means it is difficult for programmers to reliably parse CSV files from many sources or generate CSV files designed to be fed to specific external programs without a thorough understanding of those sources and programs. This PEP and the software which accompany it attempt to make the process less fragile. Existing ModulesThis problem has been tackled before. At least three modules currently available in the Python community enable programmers to read and write CSV files: Each has a different API, making it somewhat difficult for programmers to switch between them. More of a problem may be that they interpret some of the CSV corner cases differently, so even after surmounting the differences between the different module APIs, the programmer has to also deal with semantic differences between the packages. Module InterfaceThis PEP supports three basic APIs, one to read and parse CSV files, one to write them, and one to identify different CSV dialects to the readers and writers. Reading CSV FilesCSV readers are created with the reader factory function: obj = reader(iterable [, dialect='excel'] [optional keyword args]) A reader object is an iterator which takes an iterable object returning lines as the sole required parameter. If it supports a binary mode (file objects do), the iterable argument to the reader function must have been opened in binary mode. This gives the reader object full control over the interpretation of the file's contents. The optional dialect parameter is discussed below. The reader function also accepts several optional keyword arguments which define specific format settings for the parser (see the section "Formatting Parameters"). Readers are typically used as follows: csvreader = csv.reader(file("some.csv")) for row in csvreader: process(row) Each row returned by a reader object is a list of strings or Unicode objects. When both a dialect parameter and individual formatting parameters are passed to the constructor, first the dialect is queried for formatting parameters, then individual formatting parameters are examined. Writing CSV FilesCreating writers is similar: obj = writer(fileobj [, dialect='excel'], [optional keyword args]) A writer object is a wrapper around a file-like object opened for writing in binary mode (if such a distinction is made). It accepts the same optional keyword parameters as the reader constructor. Writers are typically used as follows: csvwriter = csv.writer(file("some.csv", "w")) for row in someiterable: csvwriter.writerow(row) To generate a set of field names as the first row of the CSV file, the programmer must explicitly write it, e.g.: csvwriter = csv.writer(file("some.csv", "w"), fieldnames=names) csvwriter.write(names) for row in someiterable: csvwriter.write(row) or arrange for it to be the first row in the iterable being written. Managing Different DialectsBecause CSV is a somewhat ill-defined format, there are plenty of ways one CSV file can differ from another, yet contain exactly the same data. Many tools which can import or export tabular data allow the user to indicate the field delimiter, quote character, line terminator, and other characteristics of the file. These can be fairly easily determined, but are still mildly annoying to figure out, and make for fairly long function calls when specified individually. To try and minimize the difficulty of figuring out and specifying a bunch of formatting parameters, reader and writer objects support a dialect argument which is just a convenient handle on a group of these lower level parameters. When a dialect is given as a string it identifies one of the dialects known to the module via its registration functions, otherwise it must be an instance of the Dialect class as described below. Dialects will generally be named after applications or organizations which define specific sets of format constraints. Two dialects are defined in the module as of this writing, "excel", which describes the default format constraints for CSV file export by Excel 97 and Excel 2000, and "excel-tab", which is the same as "excel" but specifies an ASCII TAB character as the field delimiter. Dialects are implemented as attribute only classes to enable users to construct variant dialects by subclassing. The "excel" dialect is a subclass of Dialect and is defined as follows: class Dialect: # placeholders delimiter = None quotechar = None escapechar = None doublequote = None skipinitialspace = None lineterminator = None quoting = None class excel(Dialect): delimiter = ',' quotechar = '"' doublequote = True skipinitialspace = False lineterminator = '\r\n' quoting = QUOTE_MINIMAL The "excel-tab" dialect is defined as: class exceltsv(excel): delimiter = '\t' (For a description of the individual formatting parameters see the section "Formatting Parameters".) To enable string references to specific dialects, the module defines several functions: dialect = get_dialect(name) names = list_dialects() register_dialect(name, dialect) unregister_dialect(name) get_dialect() returns the dialect instance associated with the given name. list_dialects() returns a list of all registered dialect names. register_dialects() associates a string name with a dialect class. unregister_dialect() deletes a name/dialect association. Formatting ParametersIn addition to the dialect argument, both the reader and writer constructors take several specific formatting parameters, specified as keyword parameters. The formatting parameters understood are:
When processing a dialect setting and one or more of the other optional parameters, the dialect parameter is processed before the individual formatting parameters. This makes it easy to choose a dialect, then override one or more of the settings without defining a new dialect class. For example, if a CSV file was generated by Excel 2000 using single quotes as the quote character and a colon as the delimiter, you could create a reader like: csvreader = csv.reader(file("some.csv"), dialect="excel", quotechar="'", delimiter=':') Other details of how Excel generates CSV files would be handled automatically because of the reference to the "excel" dialect. Reader ObjectsReader objects are iterables whose next() method returns a sequence of strings, one string per field in the row. Writer ObjectsWriter objects have two methods, writerow() and writerows(). The former accepts an iterable (typically a list) of fields which are to be written to the output. The latter accepts a list of iterables and calls writerow() for each. ImplementationThere is a sample implementation available. [1] The goal is for it to efficiently implement the API described in the PEP. It is heavily based on the Object Craft csv module. [2] Issues
References
There are many references to other CSV-related projects on the Web. A few are included here. CopyrightThis document has been placed in the public domain. PR |
カテゴリー
最新記事
(05/13)
(06/22)
(06/22)
(12/20)
(12/19)
(10/08)
(09/20)
(05/31)
(05/10)
(04/18)
(02/08)
(01/30)
(01/29)
(01/29)
(01/07)
(01/06)
(12/30)
(12/30)
(12/26)
(12/24)
カレンダー
リンク
フリーエリア
最新CM
最新TB
プロフィール
HN:
No Name Ninja
性別:
非公開
ブログ内検索
アーカイブ
最古記事
(05/10)
(11/14)
(11/17)
(11/21)
(11/30)
(11/30)
(12/02)
(12/02)
(12/04)
(12/06)
(12/06)
(12/06)
(12/06)
(12/07)
(12/07)
(12/07)
(12/07)
(12/13)
(12/13)
(12/19)
P R
カウンター
ブログの評価 ブログレーダー
|