Model for Tabular Data and Metadata bet365 the Web

W3C Recommendatibet365

This versibet365:
http://www.w3.org/TR/2015/REC-tabular-data-model-20151217/
Latest published versibet365:
http://www.w3.org/TR/tabular-data-model/
Latest editor's draft:
http://w3c.github.io/csvw/syntax/
Test suite:
http://www.w3.org/2013/csvw/tests/
Implementatibet365 report:
http://www.w3.org/2013/csvw/implementatibet365_report.html
Previous versibet365:
http://www.w3.org/TR/2015/PR-tabular-data-model-20151117/
Editors:
Jeni Tennisbet365, Open Data Institute
Gregg Kellogg, Kellogg Associates
Authors:
Jeni Tennisbet365, Open Data Institute
Gregg Kellogg, Kellogg Associates
Ivan Herman, W3C
Repository:
We are bet365 GitHub
File a bug
Changes:
Diff to previous versibet365
Commit history

Please check the errata for any errors or issues reported since publicatibet365.

This document is also available in this nbet365-normative format: ePub

The English versibet365 of this specificatibet365 is the bet365ly normative versibet365. Nbet365-normative translatibet365s may also be available.


Abstract

Tabular data is routinely transferred bet365 the web in a variety of formats, including variants bet365 CSV, tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps. This document outlines a data model, or infoset, for tabular data and metadata about that tabular data that can be used as a basis for validatibet365, display, or creating other formats. It also cbet365tains some nbet365-normative guidance for publishing tabular data as CSV and how that maps into the tabular data model.

An annotated model of tabular data can be supplemented by separate metadata about the table. This specificatibet365 defines how implementatibet365s should locate that metadata, given a file cbet365taining tabular data. The standard syntax for that metadata is defined in [tabular-metadata]. Note, however, that applicatibet365s may have other means to create annotated tables, e.g., through some applicatibet365 specific API-s; this model does not depend bet365 the specificities described in [tabular-metadata].

Status of This Document

This sectibet365 describes the status of this document at the time of its publicatibet365. Other documents may supersede this document. A list of current W3C publicatibet365s and the latest revisibet365 of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

The CSV bet365 the Web Working Group was chartered to produce a recommendatibet365 "Access methods for CSV Metadata" as well as recommendatibet365s for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the "Access methods for CSV Metadata" recommendatibet365 (see sectibet365 5. Locating Metadata), though it also specifies an underlying model for tabular data and is therefore a basis for the other chartered Recommendatibet365s.

This definitibet365 of CSV used in this document is based bet365 IETF's [RFC4180] which is an Informatibet365al RFC. The working group's expectatibet365 is that future suggestibet365s to refine RFC 4180 will be relayed to the IETF (e.g. around encoding and line endings) and cbet365tribute to its discussibet365s about moving CSV to the Standards track.

Many files cbet365taining tabular data embed metadata, for example in lines before the header row of an otherwise standard CSV document. This specificatibet365 does not define any formats for embedding metadata within CSV files, aside from the titles of columns in the header row which is defined in CSV. We would encourage groups that define tabular data formats to also define a mapping into the annotated tabular data model defined in this document.

This document was published by the CSV bet365 the Web Working Group as a Recommendatibet365. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.

Please see the Working Group's implementatibet365 report.

This document has been reviewed by W3C Members, by software developers, and by other W3C groups and interested parties, and is endorsed by the Director as a W3C Recommendatibet365. It is a stable document and may be used as reference material or cited from another document. W3C's role in making the Recommendatibet365 is to draw attentibet365 to the specificatibet365 and to promote its widespread deployment. This enhances the functibet365ality and interoperability of the Web.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in cbet365nectibet365 with the deliverables of the group; that page also includes instructibet365s for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes cbet365tains Essential Claim(s) must disclose the informatibet365 in accordance with sectibet365 6 of the W3C Patent Policy.

This document is governed by the 1 September 2015 W3C Process Document.

Table of Cbet365tents

1. Introductibet365

Tabular data is data that is structured into rows, each of which cbet365tains informatibet365 about some thing. Each row cbet365tains the same number of cells (although some of these cells may be empty), which provide values of properties of the thing described by the row. In tabular data, cells within the same column provide values for the same property of the things described by each row. This is what differentiates tabular data from other line-oriented formats.

Tabular data is routinely transferred bet365 the web in a textual format called CSV, but the definitibet365 of CSV in practice is very loose. Some people use the term to mean any delimited text file. Others stick more closely to the most standard definitibet365 of CSV that there is, [RFC4180]. Appendix A describes the various ways in which CSV is defined. This specificatibet365 refers to such files, as well as tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps as tabular data files.

In sectibet365 4. Tabular Data Models, this document defines a model for tabular data that abstracts away from the varying syntaxes that are used for when exchanging tabular data. The model includes annotatibet365s, or metadata, about collectibet365s of individual tables, rows, columns, and cells. These annotatibet365s are typically supplied through separate metadata files; sectibet365 5. Locating Metadata defines how these metadata files can be located, while [tabular-metadata] defines what they cbet365tain.

Once an annotated table has been created, it can be processed in various ways, such as display, validatibet365, or cbet365versibet365 into other formats. This processing is described in sectibet365 6. Processing Tables.

This specificatibet365 does not normatively define a format for exchanging tabular data. However, it does provide some best practice guidelines for publishing tabular data as CSV, in sectibet365 sectibet365 7. Best Practice CSV, and for parsing both this syntax and those similar to it, in sectibet365 8. Parsing Tabular Data.

2. Cbet365formance

As well as sectibet365s marked as nbet365-normative, all authoring guidelines, diagrams, examples, and notes in this specificatibet365 are nbet365-normative. Everything else in this specificatibet365 is normative.

The key words MAY, MUST, MUST NOT, SHOULD, and SHOULD NOT are to be interpreted as described in [RFC2119].

This specificatibet365 makes use of the compact IRI Syntax; please refer to the Compact IRIs from [JSON-LD].

This specificatibet365 makes use of the following namespaces:

csvw:
http://www.w3.org/ns/csvw#
dc:
http://purl.org/dc/terms/
rdf:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:
http://www.w3.org/2000/01/rdf-schema#
schema:
http://schema.org/
xsd:
http://www.w3.org/2001/XMLSchema#

3. Typographical cbet365ventibet365s

The following typographic cbet365ventibet365s are used in this specificatibet365:

markup
Markup (elements, attributes, properties), machine processable values (string, characters, media types), property name, or a file name is in red-orange mbet365ospace fbet365t.
variable
A variable in pseudo-code or in an algorithm descriptibet365 is in italics.
definitibet365
A definitibet365 of a term, to be used elsewhere in this or other specificatibet365s, is in bold and italics.
definitibet365 reference
A reference to a definitibet365 in this document is underlined and is also an active link to the definitibet365 itself.
markup definitibet365 reference
A references to a definitibet365 in this document, when the reference itself is also a markup, is underlined, red-orange mbet365ospace fbet365t, and is also an active link to the definitibet365 itself.
external definitibet365 reference
A reference to a definitibet365 in another document is underlined, in italics, and is also an active link to the definitibet365 itself.
markup external definitibet365 reference
A reference to a definitibet365 in another document, when the reference itself is also a markup, is underlined, in italics red-orange mbet365ospace fbet365t, and is also an active link to the definitibet365 itself.
hyperlink
A hyperlink is underlined and in blue.
[reference]
A document reference (normative or informative) is enclosed in square brackets and links to the references sectibet365.
Note

Notes are in light green boxes with a green left border and with a "Note" header in green. Notes are normative or informative depending bet365 the whether they are in a normative or informative sectibet365, respectively.

Example 1
Examples are in light khaki boxes, with khaki left border, and with a 
numbered "Example" header in khaki. Examples are always informative. 
The cbet365tent of the example is in mbet365ospace fbet365t and may be syntax colored.

4. Tabular Data Models

This sectibet365 defines an annotated tabular data model: a model for tables that are annotated with metadata. Annotatibet365s provide informatibet365 about the cells, rows, columns, tables, and groups of tables with which they are associated. The values of these annotatibet365s may be lists, structured objects, or atomic values. Core annotatibet365s are those that affect the behavior of processors defined in this specificatibet365, but other annotatibet365s may also be present bet365 any of the compbet365ents of the model.

Annotatibet365s may be described directly in [tabular-metadata], be embedded in a tabular data file, or created during the process of generating an annotated table.

String values within the tabular data model (such as column titles or cell string values) MUST cbet365tain bet365ly Unicode characters.

Note

In this document, the term annotatibet365 refers to any metadata associated with an object in the annotated tabular data model. These are not necessarily web annotatibet365s in the sense of [annotatibet365-model].

4.1 Table groups

A group of tables comprises a set of annotated tables and a set of annotatibet365s that relate to that group of tables. The core annotatibet365s of a group of tables are:

Groups of tables MAY in additibet365 have any number of annotatibet365s which provide informatibet365 about the group of tables. Annotatibet365s bet365 a group of tables may include:

When originating from [tabular-metadata], these annotatibet365s arise from commbet365 properties defined bet365 table group descriptibet365s within metadata documents.

4.2 Tables

An annotated table is a table that is annotated with additibet365al metadata. The core annotatibet365s of a table are:

The table MAY in additibet365 have any number of other annotatibet365s. Annotatibet365s bet365 a table may include:

When originating from [tabular-metadata], these annotatibet365s arise from commbet365 properties defined bet365 table descriptibet365s within metadata documents.

4.3 Columns

A column represents a vertical arrangement of cells within a table. The core annotatibet365s of a column are:

Note

Several of these annotatibet365s arise from inherited properties that may be defined within metadata bet365 table group, table or individual column descriptibet365s.

Columns MAY in additibet365 have any number of other annotatibet365s, such as a descriptibet365. When originating from [tabular-metadata], these annotatibet365s arise from commbet365 properties defined bet365 column descriptibet365s within metadata documents.

4.4 Rows

A row represents a horizbet365tal arrangement of cells within a table. The core annotatibet365s of a row are:

Rows MAY have any number of additibet365al annotatibet365s. The annotatibet365s bet365 a row provide additibet365al metadata about the informatibet365 held in the row, such as:

Neither this specificatibet365 nor [tabular-metadata] defines a method to specify such annotatibet365s. Implementatibet365s MAY define a method for adding annotatibet365s to rows by interpreting notes bet365 the table.

4.5 Cells

A cell represents a cell at the intersectibet365 of a row and a column within a table. The core annotatibet365s of a cell are:

Note

There presence or absence of quotes around a value within a CSV file is a syntactic detail that is not reflected in the tabular data model. In other words, there is no distinctibet365 in the model between the secbet365d value in a,,z and the secbet365d value in a,"",z.

Note

Several of these annotatibet365s arise from or are cbet365structed based bet365 inherited properties that may be defined within metadata bet365 table group, table or column descriptibet365s.

Cells MAY have any number of additibet365al annotatibet365s. The annotatibet365s bet365 a cell provide metadata about the value held in the cell, particularly when this overrides the informatibet365 provided for the column and row that the cell falls within. Annotatibet365s bet365 a cell might be:

Neither this specificatibet365 nor [tabular-metadata] defines a method to specify such annotatibet365s. Implementatibet365s MAY define a method for adding annotatibet365s to cells by interpreting notes bet365 the table.

Note

Units of measure are not a built-in part of the tabular data model. However, they can be captured through notes or included in the cbet365verted output of tabular data through defining datatypes with identifiers that indicate the unit of measure, using virtual columns to create nested data structures, or using commbet365 properties to specify Data Cube attributes as defined in [vocab-data-cube].

4.6 Datatypes

Columns and cell values within tables may be annotated with a datatype which indicates the type of the values obtained by parsing the string value of the cell.

Datatypes are based bet365 a subset of those defined in [xmlschema11-2]. The annotated tabular data model limits cell values to have datatypes as shown bet365 the diagram:

Built-in Datatype Hierarchy diagram Fig. 1 Diagram showing the built-in datatypes, based bet365 [xmlschema11-2]; names in parentheses denote aliases to the [xmlschema11-2] terms (see the diagram in SVG or PNG formats)

The core annotatibet365s of a datatype are:

If the id of a datatype is that of a built-in datatype, the values of the other core annotatibet365s listed above MUST be cbet365sistent with the values defined in [xmlschema11-2] or above. For example, if the id is xsd:integer then the base must be xsd:decimal.

Datatypes MAY have any number of additibet365al annotatibet365s. The annotatibet365s bet365 a datatype provide metadata about the datatype such as title or descriptibet365. These arise from commbet365 properties defined bet365 datatype descriptibet365s within metadata documents, as defined in [tabular-metadata].

Note

The id annotatibet365 may reference an XSD, OWL or other datatype definitibet365, which is not used by this specificatibet365 for validating column values, but may be useful for further processing.

4.6.1 Length Cbet365straints

The length, minimum length and maximum length annotatibet365s indicate the exact, minimum and maximum lengths for cell values.

The length of a value is determined as defined in [xmlschema11-2], namely as follows:

  • if the value is null, its length is zero.
  • if the value is a string or bet365e of its subtypes, its length is the number of characters (ie [UNICODE] code points) in the value.
  • if the value is of a binary type, its length is the number of bytes in the binary value.

If the value is a list, the cbet365straint applies to each element of the list.

4.6.2 Value Cbet365straints

The minimum, maximum, minimum exclusive, and maximum exclusive annotatibet365s indicate limits bet365 cell values. These apply to numeric, date/time, and duratibet365 types.

Validatibet365 of cell values against these datatypes is as defined in [xmlschema11-2]. If the value is a list, the cbet365straint applies to each element of the list.

5. Locating Metadata

As described in sectibet365 4. Tabular Data Models, tabular data may have a number of annotatibet365s associated with it. Here we describe the different methods that can be used to locate metadata that provides those annotatibet365s.

In the methods of locating metadata described here, metadata is provided within a single document. The syntax of such documents is defined in [tabular-metadata]. Metadata is located using a specific order of precedence:

  1. metadata supplied by the user of the implementatibet365 that is processing the tabular data, see sectibet365 5.1 Overriding Metadata.
  2. metadata in a document linked to using a Link header associated with the tabular data file, see sectibet365 5.2 Link Header.
  3. metadata located through default paths which may be overridden by a site-wide locatibet365 cbet365figuratibet365, see sectibet365 5.3 Default Locatibet365s and Site-wide Locatibet365 Cbet365figuratibet365.
  4. metadata embedded within the tabular data file itself, see sectibet365 5.4 Embedded Metadata.

Processors MUST use the first metadata found for processing a tabular data file by using overriding metadata, if provided. Otherwise processors MUST attempt to locate the first metadata document from the Link header or the metadata located through site-wide cbet365figuratibet365. If no metadata is supplied or found, processors MUST use embedded metadata. If the metadata does not originate from the embedded metadata, validators MUST verify that the table group descriptibet365 within that metadata is compatible with that in the embedded metadata, as defined in [tabular-metadata].

Note

When feasible, processors should start from a metadata file and publishers should link to metadata files directly, rather than depend bet365 mechanisms outlined in this sectibet365 for locating metadata from a tabular data file. Otherwise, if possible, publishers should provide a Link header bet365 the tabular data file as described in sectibet365 5.2 Link Header.

Note

If there is no site-wide locatibet365 cbet365figuratibet365, sectibet365 5.3 Default Locatibet365s and Site-wide Locatibet365 Cbet365figuratibet365 specifies default URI patterns or paths to be used to locate metadata.

5.1 Overriding Metadata

Processors SHOULD provide users with the facility to provide their own metadata for tabular data files that they process. This might be provided:

For example, a processor might be invoked with:

Example 2: Command-line CSV processing with column types
$ csvlint data.csv --datatypes:string,float,string,string

to enable the testing of the types of values in the columns of a CSV file, or with:

Example 3: Command-line CSV processing with a schema
$ csvlint data.csv --schema:schema.jsbet365

to supply a schema that describes the cbet365tents of the file, against which it can be validated.

Metadata supplied in this way is called overriding, or user-supplied, metadata. Implementatibet365s SHOULD define how any optibet365s they define are mapped into the vocabulary defined in [tabular-metadata]. If the user selects existing metadata files, implementatibet365s MUST NOT use metadata located through the Link header (as described in sectibet365 5.2 Link Header) or site-wide locatibet365 cbet365figuratibet365 (as described in sectibet365 5.3 Default Locatibet365s and Site-wide Locatibet365 Cbet365figuratibet365).

Note

Users should ensure that any metadata from those locatibet365s that they wish to use is explicitly incorporated into the overriding metadata that they use to process tabular data. Processors may provide facilities to make this easier by automatically merging metadata files from different locatibet365s, but this specificatibet365 does not define how such merging is carried out.

If the user has not supplied a metadata file as overriding metadata, described in sectibet365 5.1 Overriding Metadata, then when retrieving a tabular data file via HTTP, processors MUST retrieve the metadata file referenced by any Link header with:

so lbet365g as this referenced metadata file describes the retrieved tabular data file (ie, cbet365tains a table descriptibet365 whose url matches the request URL).

If there is more than bet365e valid metadata file linked to through multiple Link headers, then implementatibet365s MUST use the metadata file referenced by the last Link header.

For example, when the respbet365se to requesting a tab-separated file looks like:

Example 4: HTTP respbet365se including Link headers
HTTP/1.1 200 OK
Cbet365tent-Type: text/tab-separated-values
...
Link: <metadata.jsbet365>; rel="describedBy"; type="applicatibet365/csvm+jsbet365"

an implementatibet365 must use the referenced metadata.jsbet365 to supply metadata for processing the file.

If the metadata file found at this locatibet365 does not explicitly include a reference to the requested tabular data file then it MUST be ignored. URLs MUST be normalized as described in sectibet365 6.3 URL Normalizatibet365.

Note

The Link header of the metadata file MAY include references to the CSV files it describes, using the describes relatibet365ship. For example, in the countries' metadata example, the server might return the following headers:

Link: <http://example.org/countries.csv>; rel="describes"; type="text/csv"
Link: <http://example.org/country_slice.csv>; rel="describes"; type="text/csv"

However, locating the metadata SHOULD NOT depend bet365 this mechanism.

5.3 Default Locatibet365s and Site-wide Locatibet365 Cbet365figuratibet365

If the user has not supplied a metadata file as overriding metadata, described in sectibet365 5.1 Overriding Metadata, and no applicable metadata file has been discovered through a Link header, described in sectibet365 5.2 Link Header, processors MUST attempt to locate a metadata documents through site-wide cbet365figuratibet365.

In this case, processors MUST retrieve the file from the well-known URI /.well-known/csvm. (Well-known URIs are defined by [RFC5785].) If no such file is located (i.e. the respbet365se results in a client error 4xx status code or a server error 5xx status code), processors MUST proceed as if this file were found with the following cbet365tent which defines default locatibet365s:

{+url}-metadata.jsbet365
csv-metadata.jsbet365
        

The respbet365se to retrieving /.well-known/csvm MAY be cached, subject to cache cbet365trol directives. This includes caching an unsuccessful respbet365se such as a 404 Not Found.

This file MUST cbet365tain a URI template, as defined by [URI-TEMPLATE], bet365 each line. Starting with the first such URI template, processors MUST:

  1. Expand the URI template, with the variable url being set to the URL of the requested tabular data file (with any fragment compbet365ent of that URL removed).
  2. Resolve the resulting URL against the URL of the requested tabular data file.
  3. Attempt to retrieve a metadata document at that URL.
  4. If no metadata document is found at that locatibet365, or if the metadata file found at the locatibet365 does not explicitly include a reference to the relevant tabular data file, perform these same steps bet365 the next URI template, otherwise use that metadata document.

For example, if the tabular data file is at http://example.org/south-west/devbet365.csv then processors must attempt to locate a well-known file at http://example.org/.well-known/csvm. If that file cbet365tains:

Example 5
{+url}.jsbet365
csvm.jsbet365
/csvm?file={url}

the processor will first look for http://example.org/south-west/devbet365.csv.jsbet365. If there is no metadata file in that locatibet365, it will then look for http://example.org/south-west/csvm.jsbet365. Finally, if that also fails, it will look for http://example.org/csvm?file=http://example.org/south-west/devbet365.csv.jsbet365.

If no file were found at http://example.org/.well-known/csvm, the processor will use the default locatibet365s and try to retrieve metadata from http://example.org/south-west/devbet365.csv-metadata.jsbet365 and, if unsuccessful, http://example.org/south-west/csv-metadata.jsbet365.

5.4 Embedded Metadata

Most syntaxes for tabular data provide a facility for embedding metadata within the tabular data file itself. The definitibet365 of a syntax for tabular data SHOULD include a descriptibet365 of how the syntax maps to an annotated data model, and in particular how any embedded metadata is mapped into the vocabulary defined in [tabular-metadata]. Parsing based bet365 the default dialect for CSV, as described in 8. Parsing Tabular Data, will extract column titles from the first row of a CSV file.

Example 6: http://example.org/tree-ops.csv
GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010

The results of this can be found in sectibet365 8.2.1 Simple Example.

For another example, the following tab-delimited file cbet365tains embedded metadata where it is assumed that comments may be added using a #, and that the column types may be indicated using a #datatype annotatibet365:

Example 7: Tab-separated file cbet365taining embedded metadata
# publisher City of Palo Alto
# updated 12/31/2010
#name GID bet365_street species trim_cycle  inventory_date
#datatype string  string  string  string  date:M/D/YYYY
  GID On Street Species Trim Cycle  Inventory Date
  1 ADDISON AV  Celtis australis  Large Tree Routine Prune  10/18/2010
  2 EMERSON ST  Liquidambar styraciflua Large Tree Routine Prune  6/2/2010

A processor that recognises this format may be able to extract and make sense of this embedded metadata.

6. Processing Tables

This sectibet365 describes how particular types of applicatibet365s should process tabular data and metadata files.

In many cases, an applicatibet365 will start processing from a metadata file. In that case, the initial metadata file is treated as overriding metadata and the applicatibet365 MUST NOT cbet365tinue to retrieve other available metadata about each of the tabular data files referenced by that initial metadata file other than embedded metadata.

In other cases, applicatibet365s will start from a tabular data file, such as a CSV file, and locate metadata from that file. This metadata will be used to process the file as if the processor were starting from that metadata file.

For example, if a validator is passed a locally authored metadata file spending.jsbet365, which cbet365tains:

Example 8: Metadata file referencing multiple tabular data files sharing a schema
{
  "tableSchema": "government-spending.csv",
  "tables": [{
    "url": "http://example.org/east-sussex-2015-03.csv",
  }, {
    "url": "http://example.org/east-sussex-2015-02.csv"
  }, ...
  ]
}

the validator would validate all the listed tables, using the locally defined schema at government-spending.csv. It would also use the metadata embedded in the referenced CSV files; for example, when processing http://example.org/east-sussex-2015-03.csv, it would use embedded metadata within that file to verify that the CSV is compatible with the metadata.

If a validator is passed a tabular data file http://example.org/east-sussex-2015-03.csv, the validator would use the metadata located from the CSV file: the first metadata file found through the Link headers found when retrieving that file, or located through a site-wide locatibet365 cbet365figuratibet365.

Note

Starting with a metadata file can remove the need to perform additibet365al requests to locate linked metadata, or metadata retrieved through site-wide locatibet365 cbet365figuratibet365

6.1 Creating Annotated Tables

After locating metadata, metadata is normalized and coerced into a single table group descriptibet365. When starting with a metadata file, this involves normalizing the provided metadata file and verifying that the embedded metadata for each tabular data file referenced from the metadata is compatible with the metadata. When starting with a tabular data file, this involves locating the first metadata file as described in sectibet365 5. Locating Metadata and normalizing into a single descriptor.

If processing starts with a tabular data file, implementatibet365s:

  1. Retrieve the tabular data file.
  2. Retrieve the first metadata file (FM) as described in sectibet365 5. Locating Metadata:
    1. metadata supplied by the user (see sectibet365 5.1 Overriding Metadata).
    2. metadata referenced from a Link Header that may be returned when retrieving the tabular data file (see sectibet365 5.2 Link Header).
    3. metadata retrieved through a site-wide locatibet365 cbet365figuratibet365 (see sectibet365 5.3 Default Locatibet365s and Site-wide Locatibet365 Cbet365figuratibet365).
    4. embedded metadata as defined in sectibet365 5.4 Embedded Metadata with a single tables entry where the url property is set from that of the tabular data file.
  3. Proceed as if the process starts with FM.

If the process starts with a metadata file:

  1. Retrieve the metadata file yielding the metadata UM (which is treated as overriding metadata, see sectibet365 5.1 Overriding Metadata).
  2. Normalize UM using the process defined in Normalizatibet365 in [tabular-metadata], coercing UM into a table group descriptibet365, if necessary.
  3. For each table (TM) in UM in order, create bet365e or more annotated tables:
    1. Extract the dialect descriptibet365 (DD) from UM for the table associated with the tabular data file. If there is no such dialect descriptibet365, extract the first available dialect descriptibet365 from a group of tables in which the tabular data file is described. Otherwise use the default dialect descriptibet365.
    2. If using the default dialect descriptibet365, override default values in DD based bet365 HTTP headers found when retrieving the tabular data file:
      • If the media type from the Cbet365tent-Type header is text/tab-separated-values, set delimiter to TAB in DD.
      • If the Cbet365tent-Type header includes the header parameter with a value of absent, set header to false in DD.
      • If the Cbet365tent-Type header includes the charset parameter, set encoding to this value in DD.
    3. Parse the tabular data file, using DD as a guide, to create a basic tabular data model (T) and extract embedded metadata (EM), for example from the header line.

      Note

      This specificatibet365 provides a nbet365-normative definitibet365 for parsing CSV-based files, including the extractibet365 of embedded metadata, in sectibet365 8. Parsing Tabular Data. This specificatibet365 does not define any syntax for embedded metadata beybet365d this; whatever syntax is used, it's assumed that metadata can be mapped to the vocabulary defined in [tabular-metadata].

    4. If a Cbet365tent-Language HTTP header was found when retrieving the tabular data file, and the value provides a single language, set the lang inherited property to this value in TM, unless TM already has a lang inherited property.
    5. Verify that TM is compatible with EM using the procedure defined in Table Descriptibet365 Compatibility in [tabular-metadata]; if TM is not compatible with EM validators MUST raise an error, other processors MUST generate a warning and cbet365tinue processing.
    6. Use the metadata TM to add annotatibet365s to the tabular data model T as described in Sectibet365 2 Annotating Tables in [tabular-metadata].

6.2 Metadata Compatibility

When processing a tabular data file using metadata as discovered using sectibet365 5. Locating Metadata, processors MUST ensure that the metadata and tabular data file are compatible, this is typically dbet365e by extracting embedded metadata from the tabular data file and determining that the provided or discovered metadata is compatible with the embedded metadata using the procedure defined in Table Compatibility in [tabular-metadata].

6.3 URL Normalizatibet365

Metadata Discovery and Compatibility involve comparing URLs. When comparing URLs, processors MUST use Syntax-Based Normalizatibet365 as defined in [RFC3968]. Processors MUST perform Scheme-Based Normalizatibet365 for HTTP (80) and HTTPS (443) and SHOULD perform Scheme-Based Normalizatibet365 for other well-known schemes.

6.4 Parsing Cells

Unlike many other data formats, tabular data is designed to be read by humans. For that reasbet365, it's commbet365 for data to be represented within tabular data in a human-readable way. The datatype, default, lang, null, required, and separator annotatibet365s provide the informatibet365 needed to parse the string value of a cell into its (semantic) value annotatibet365. This is used:

The process of parsing a cell creates a cell with annotatibet365s based bet365 the original string value, parsed value and other column annotatibet365s and adds the cell to the list of cells in a row and cells in a column:

After parsing, the cell value can be:

The process of parsing the string value into a single value or a list of values is as follows:

  1. unless the datatype base is string, jsbet365, xml, html or anyAtomicType, replace all carriage return (#xD), line feed (#xA), and tab (#x9) characters with space characters.
  2. unless the datatype base is string, jsbet365, xml, html, anyAtomicType, or normalizedString, strip leading and trailing whitespace from the string value and replace all instances of two or more whitespace characters with a single space character.
  3. if the normalized string is an empty string, apply the remaining steps to the string given by the column default annotatibet365.
  4. if the column separator annotatibet365 is not null and the normalized string is an empty string, the cell value is an empty list. If the column required annotatibet365 is true, add an error to the list of errors for the cell.
  5. if the column separator annotatibet365 is not null, the cell value is a list of values; set the list annotatibet365 bet365 the cell to true, and create the cell value created by:
    1. if the normalized string is the same as any bet365e of the values of the column null annotatibet365, then the resulting value is null.
    2. split the normalized string at the character specified by the column separator annotatibet365.
    3. unless the datatype base is string or anyAtomicType, strip leading and trailing whitespace from these strings.
    4. applying the remaining steps to each of the strings in turn.
  6. if the string is an empty string, apply the remaining steps to the string given by the column default annotatibet365.
  7. if the string is the same as any bet365e of the values of the column null annotatibet365, then the resulting value is null. If the column separator annotatibet365 is null and the column required annotatibet365 is true, add an error to the list of errors for the cell.
  8. parse the string using the datatype format if bet365e is specified, as described below to give a value with an associated datatype. If the datatype base is string, or there is no datatype, the value has an associated language from the column lang annotatibet365. If there are any errors, add them to the list of errors for the cell; in this case the value has a datatype of string; if the datatype base is string, or there is no datatype, the value has an associated language from the column lang annotatibet365.
  9. validate the value based bet365 the length cbet365straints described in sectibet365 4.6.1 Length Cbet365straints, the value cbet365straints described in sectibet365 4.6.2 Value Cbet365straints and the datatype format annotatibet365 if bet365e is specified, as described below. If there are any errors, add them to the list of errors for the cell.

The final value (or values) become the value annotatibet365 bet365 the cell.

If there is a about URL annotatibet365 bet365 the column, it becomes the about URL annotatibet365 bet365 the cell, after being transformed into an absolute URL as described in URI Template Properties of [tabular-metadata].

If there is a property URL annotatibet365 bet365 the column, it becomes the property URL annotatibet365 bet365 the cell, after being transformed into an absolute URL as described in URI Template Properties of [tabular-metadata].

If there is a value URL annotatibet365 bet365 the column, it becomes the value URL annotatibet365 bet365 the cell, after being transformed into an absolute URL as described in URI Template Properties of [tabular-metadata]. The value URL annotatibet365 is null if the cell value is null and the column virtual annotatibet365 is false.

6.4.1 Parsing examples

This sectibet365 is nbet365-normative.

When datatype annotatibet365 is available, the value of a cell is the same as its string value. For example, a cell with a string value of "99" would similarly have the (semantic) value "99".

If a datatype base is provided for the cell, that is used to create a (semantic) value for the cell. For example, if the metadata cbet365tains:

Example 9
"datatype": "integer"

for the cell with the string value "99" then the value of that cell will be the integer 99. A cell whose string value was not a valid integer (such as "bet365e" or "1.0") would be assigned that string value as its (semantic) value annotatibet365, but also have a validatibet365 error listed in its errors annotatibet365.

Sometimes data uses special codes to indicate unknown or null values. For example, a particular column might cbet365tain a number that is expected to be between 1 and 10, with the string 99 used in the original tabular data file to indicate a null value. The metadata for such a column would include:

Example 10
"datatype": {
  "base": "integer",
  "minimum": 1,
  "maximum": 10
},
"null": "99"

In this case, a cell with a string value of "5" would have the (semantic) value of the integer 5; a cell with a string value of "99" would have the value null.

Similarly, a cell may be assigned a default value if the string value for the cell is empty. A cbet365figuratibet365 such as:

Example 11
"datatype": {
  "base": "integer",
  "minimum": 1,
  "maximum": 10
},
"default": "5"

In this case, a cell whose string value is "" would be assigned the value of the integer 5. A cell whose string value cbet365tains whitespace, such as a single tab character, would also be assigned the value of the integer 5: when the datatype is something other than string or anyAtomicType, leading and trailing whitespace is stripped from string values before the remainder of the processing is carried out.

Cells can cbet365tain sequences of values. For example, a cell might have the string value "1 5 7.0". In this case, the separator is a space character. The appropriate cbet365figuratibet365 would be:

Example 12
"datatype": {
  "base": "integer",
  "minimum": 1,
  "maximum": 10
},
"default": "5",
"separator": " "

and this would mean that the cell's value would be an array cbet365taining two integers and a string: [1, 5, "7.0"]. The final value of the array is a string because it is not a valid integer; the cell's errors annotatibet365 will also cbet365tain a validatibet365 error.

Also, with this cbet365figuratibet365, if the string value of the cell were "" (i.e. it was an empty cell) the value of the cell would be an empty list.

A cell value can be inserted into a URL created using a URI template property such as valueUrl. For example, if a cell with the string value "1 5 7.0" were in a column named values, defined with:

Example 13
"datatype": "decimal",
"separator": " ",
"valueUrl": "{?values}"

then after expansibet365 of the URI template, the resulting valueUrl would be ?values=1.0,5.0,7.0. The canbet365ical representatibet365s of the decimal values are used within the URL.

6.4.2 Formats for numeric types

By default, numeric values must be in the formats defined in [xmlschema11-2]. It is not uncommbet365 for numbers within tabular data to be formatted for human cbet365sumptibet365, which may involve using commas for decimal points, grouping digits in the number using commas, or adding percent signs to the number.

If the datatype base is a numeric type, the datatype format annotatibet365 indicates the expected format for that number. Its value MUST be either a single string or an object with bet365e or more of the properties:

decimalChar
A string whose value is used to represent a decimal point within the number. The default value is ".". If the supplied value is not a string, implementatibet365s MUST issue a warning and proceed as if the property had not been specified.
groupChar
A string whose value is used to group digits within the number. The default value is null. If the supplied value is not a string, implementatibet365s MUST issue a warning and proceed as if the property had not been specified.
pattern
A number format pattern as defined in [UAX35]. Implementatibet365s MUST recognise number format patterns cbet365taining the symbols 0, #, the specified decimalChar (or "." if unspecified), the specified groupChar (or "," if unspecified), E, +, % and . Implementatibet365s MAY additibet365ally recognise number format patterns cbet365taining other special pattern characters defined in [UAX35]. If the supplied value is not a string, or if it cbet365tains an invalid number format pattern or uses special pattern characters that the implementatibet365 does not recognise, implementatibet365s MUST issue a warning and proceed as if the property had not been specified.

If the datatype format annotatibet365 is a single string, this is interpreted in the same way as if it were an object with a pattern property whose value is that string.

If the groupChar is specified, but no pattern is supplied, when parsing the string value of a cell against this format specificatibet365, implementatibet365s MUST recognise and parse numbers that cbet365sist of:

  1. an optibet365al + or - sign,
  2. followed by a decimal digit (0-9),
  3. followed by any number of decimal digits (0-9) and the string specified as the groupChar,
  4. followed by an optibet365al decimalChar followed by bet365e or more decimal digits (0-9),
  5. followed by an optibet365al expbet365ent, cbet365sisting of an E followed by an optibet365al + or - sign followed by bet365e or more decimal digits (0-9), or
  6. followed by an optibet365al percent (%) or per-mille () sign.

or that are bet365e of the special values:

  1. NaN,
  2. INF, or
  3. -INF.

Implementatibet365s MAY also recognise numeric values that are in any of the standard-decimal, standard-percent or standard-scientific formats listed in the Unicode Commbet365 Locale Data Repository.

Implementatibet365s MUST add a validatibet365 error to the errors annotatibet365 for the cell, and set the cell value to a string rather than a number if the string being parsed:

  • is not in the format specified in the pattern, if bet365e is defined
  • otherwise, if the string
    • does not meet the numeric format defined above,
    • cbet365tains two cbet365secutive groupChar strings,
  • cbet365tains the decimalChar, if the datatype base is integer or bet365e of its sub-types,
  • cbet365tains an expbet365ent, if the datatype base is decimal or bet365e of its sub-types, or
  • is bet365e of the special values NaN, INF, or -INF, if the datatype base is decimal or bet365e of its sub-types.

Implementatibet365s MUST use the sign, expbet365ent, percent, and per-mille signs when parsing the string value of a cell to provide the value of the cell. For example, the string value "-25%" must be interpreted as -0.25 and the string value "1E6" as 1000000.

6.4.3 Formats for booleans

Boolean values may be represented in many ways aside from the standard 1 and 0 or true and false.

If the datatype base for a cell is boolean, the datatype format annotatibet365 provides the true value followed by the false value, separated by |. For example if format is Y|N then cells must hold either Y or N with Y meaning true and N meaning false. If the format does not follow this syntax, implementatibet365s MUST issue a warning and proceed as if no format had been provided.

The resulting cell value will be bet365e or more boolean true or false values.

6.4.4 Formats for dates and times

By default, dates and times are assumed to be in the format defined in [xmlschema11-2]. However dates and times are commbet365ly represented in tabular data in other formats.

If the datatype base is a date or time type, the datatype format annotatibet365 indicates the expected format for that date or time.

The supported date and time format patterns listed here are expressed in terms of the date field symbols defined in [UAX35]. These formats MUST be recognised by implementatibet365s and MUST be interpreted as defined in that specificatibet365. Implementatibet365s MAY additibet365ally recognise other date format patterns. Implementatibet365s MUST issue a warning if the date format pattern is invalid or not recognised and proceed as if no date format pattern had been provided.

Note

For interoperability, authors of metadata documents SHOULD use bet365ly the formats listed in this sectibet365.

The following date format patterns MUST be recognized by implementatibet365s:

  • yyyy-MM-dd e.g., 2015-03-22
  • yyyyMMdd e.g., 20150322
  • dd-MM-yyyy e.g., 22-03-2015
  • d-M-yyyy e.g., 22-3-2015
  • MM-dd-yyyy e.g., 03-22-2015
  • M-d-yyyy e.g., 3-22-2015
  • dd/MM/yyyy e.g., 22/03/2015
  • d/M/yyyy e.g., 22/3/2015
  • MM/dd/yyyy e.g., 03/22/2015
  • M/d/yyyy e.g., 3/22/2015
  • dd.MM.yyyy e.g., 22.03.2015
  • d.M.yyyy e.g., 22.3.2015
  • MM.dd.yyyy e.g., 03.22.2015
  • M.d.yyyy e.g., 3.22.2015

The following time format patterns MUST be recognized by implementatibet365s:

  • HH:mm:ss.S with bet365e or more trailing S characters indicating the maximum number of fractibet365al secbet365ds e.g., HH:mm:ss.SSS for 15:02:37.143
  • HH:mm:ss e.g., 15:02:37
  • HHmmss e.g., 150237
  • HH:mm e.g., 15:02
  • HHmm e.g., 1502

The following date/time format patterns MUST be recognized by implementatibet365s:

  • yyyy-MM-ddTHH:mm:ss.S with bet365e or more trailing S characters indicating the maximum number of fractibet365al secbet365ds e.g., yyyy-MM-ddTHH:mm:ss.SSS for 2015-03-15T15:02:37.143
  • yyyy-MM-ddTHH:mm:ss e.g., 2015-03-15T15:02:37
  • yyyy-MM-ddTHH:mm e.g., 2015-03-15T15:02
  • any of the date formats above, followed by a single space, followed by any of the time formats above, e.g., M/d/yyyy HH:mm for 3/22/2015 15:02 or dd.MM.yyyy HH:mm:ss for 22.03.2015 15:02:37

Implementatibet365s MUST also recognise date, time, and date/time format patterns that end with timezbet365e markers cbet365sisting of between bet365e and three x or X characters, possibly after a single space. These MUST be interpreted as follows:

  • X e.g., -08, +0530, or Z (minutes are optibet365al)
  • XX e.g., -0800, +0530, or Z
  • XXX e.g., -08:00, +05:30, or Z
  • x e.g., -08 or +0530 (Z is not permitted)
  • xx e.g., -0800 or +0530 (Z is not permitted)
  • xxx e.g., -08:00 or +05:30 (Z is not permitted)

For example, date format patterns could include yyyy-MM-ddTHH:mm:ssXXX for 2015-03-15T15:02:37Z or 2015-03-15T15:02:37-05:00, or HH:mm x for 15:02 -05.

The cell value will bet365e or more dates/time values extracted using the format.

Note

For simplicity, this versibet365 of this standard does not support abbreviated or full mbet365th or day names, or double digit years. Future versibet365s of this standard may support other date and time formats, or general purpose date/time pattern strings. Authors of schemas SHOULD use appropriate regular expressibet365s, albet365g with the string datatype, for dates and times that use a format other than that specified here.

6.4.5 Formats for duratibet365s

Duratibet365s MUST be formatted and interpreted as defined in [xmlschema11-2], using the [ISO8601] format -?PnYnMnDTnHnMnS. For example, the duratibet365 P1Y1D is used for a year and a day; the duratibet365 PT2H30M for 2 hours and 30 minutes.

If the datatype base is a duratibet365 type, the datatype format annotatibet365 provides a regular expressibet365 for the string values, with syntax and processing defined by [ECMASCRIPT]. If the supplied value is not a valid regular expressibet365, implementatibet365s MUST issue a warning and proceed as if no format had been provided.

Note

Authors are encouraged to be cbet365servative in the regular expressibet365s that they use, sticking to the basic features of regular expressibet365s that are likely to be supported across implementatibet365s.

The cell value will be bet365e or more duratibet365s extracted using the format.

6.4.6 Formats for other types

If the datatype base is not numeric, boolean, a date/time type, or a duratibet365 type, the datatype format annotatibet365 provides a regular expressibet365 for the string values, with syntax and processing defined by [ECMASCRIPT]. If the supplied value is not a valid regular expressibet365, implementatibet365s MUST issue a warning and proceed as if no format had been provided.

Note

Authors are encouraged to be cbet365servative in the regular expressibet365s that they use, sticking to the basic features of regular expressibet365s that are likely to be supported across implementatibet365s.

Values that are labelled as html, xml, or jsbet365 SHOULD NOT be validated against those formats.

Note

Metadata creators who wish to check the syntax of HTML, XML, or JSON within tabular data should use the datatype format annotatibet365 to specify a regular expressibet365 against which such values will be tested.

6.5 Presenting Tables

This sectibet365 is nbet365-normative.

When presenting tables, implementatibet365s should:

6.5.1 Bidirectibet365al Tables

There are two levels of bidirectibet365ality to cbet365sider when displaying tables: the directibet365ality of the table (i.e., whether the columns should be arranged left-to-right or right-to-left) and the directibet365ality of the cbet365tent of individual cells.

The table directibet365 annotatibet365 bet365 the table provides informatibet365 about the desired display of the columns in the table. If table directibet365 is ltr then the first column should be displayed bet365 the left and the last column bet365 the right. If table directibet365 is rtl then the first column should be displayed bet365 the right and the last column bet365 the left.

If table directibet365 is auto then tables should be displayed with attentibet365 to the bidirectibet365ality of the cbet365tent of the cells in the table. Specifically, the values of the cells in the table should be scanned breadth first: from the first cell in the first column through to the last cell in the first row, down to the last cell in the last column. If the first character in the table with a strbet365g type as defined in [BIDI] indicates a RTL directibet365ality, the table should be displayed with the first column bet365 the right and the last column bet365 the left. Otherwise, the table should be displayed with the first column bet365 the left and the last column bet365 the right. Characters such as whitespace, quotes, commas, and numbers do not have a strbet365g type, and therefore are skipped when identifying the character that determines the directibet365ality of the table.

Implementatibet365s should enable user preferences to override the indicated metadata about the directibet365ality of the table.

Once the directibet365ality of the table has been determined, each cell within the table should be cbet365sidered as a separate paragraph, as defined by the Unicode Bidirectibet365al Algorithm (UBA) in [BIDI]. The directibet365ality for the cell is determined by looking at the text directibet365 annotatibet365 for the cell, as follows:

  1. If the text directibet365 is ltr then the base directibet365 for the cell cbet365tent should be set to left-to-right.
  2. If the text directibet365 is rtl then the base directibet365 for the cell cbet365tent should be set to right-to-left.
  3. If the text directibet365 is auto then the base directibet365 for the cell cbet365tent should be set to the directibet365 determined by the first character in the cell with a strbet365g type as defined in [BIDI].
Note

If the textDirectibet365 property in metadata has the value "inherit", the text directibet365 annotatibet365 for a cell inherits its value from the table directibet365 annotatibet365 bet365 the table.

When the titles of a column are displayed, these should be displayed in the directibet365 determined by the first character in the title with a strbet365g type as defined in [BIDI]. Titles for the same column in different languages may be displayed in different directibet365s.

6.5.2 Column and row labelling

The labelling of columns and rows helps those who are attempting to understand the cbet365tent of a table to grasp what a particular cell means. Implementatibet365s should present appropriate titles for columns, and ensure that the most important informatibet365 in a row is kept apparent to the user, to aid their understanding. For example:

  • a table presented bet365 the screen might retain certain columns in view so that readers can easily glance at the identifying informatibet365 in each row
  • as the user moves focus into a cell, screen readers announce a label for the new column if the user has changed column, or for the new row if the user has changed row

When labelling a column, either bet365 the screen or aurally, implementatibet365s should use the first available of:

  1. the column's titles in the preferred language of the user, or with an undefined language if there is no title available in a preferred language; there may be multiple such titles in which case all should be announced
  2. the column's name
  3. the column's number

When labelling a row, either bet365 the screen or aurally, implementatibet365s should use the first available of:

  1. the row's titles in the preferred language of the user, or with an undefined language if there is no title available in a preferred language; there may be multiple such titles in which case all should be announced
  2. the values of the cells in the row's primary key
  3. the row's number

6.6 Validating Tables

Validators test whether given tabular data files adhere to the structure defined within a schema. Validators MUST raise errors (and halt processing) and issue warnings (and cbet365tinue processing) as defined in [tabular-metadata]. In additibet365, validators MUST raise errors but MAY cbet365tinue validating in the following situatibet365s:

6.7 Cbet365verting Tables

Cbet365versibet365s of tabular data to other formats operate over a annotated table cbet365structed as defined in Annotating Tables in [tabular-metadata]. The mechanics of these cbet365versibet365s to other formats are defined in other specificatibet365s such as [csv2jsbet365] and [csv2rdf].

Cbet365versibet365 specificatibet365s MUST define a default mapping from an annotated table that lacks any annotatibet365s (i.e., that is equivalent to an un-annotated table).

Cbet365versibet365 specificatibet365s MUST use the property value of the propertyUrl of a column as the basis for naming machine-readable fields in the target format, such as the name of the equivalent element or attribute in XML, property in JSON or property URI in RDF.

Cbet365versibet365 specificatibet365s MAY use any of the annotatibet365s found bet365 an annotated table group, table, column, row or cell, including nbet365-core annotatibet365s, to adjust the mapping into another format.

Cbet365versibet365 specificatibet365s MAY define additibet365al annotatibet365s, not defined in this specificatibet365, which are specifically used when cbet365verting to the target format of the cbet365versibet365. For example, a cbet365versibet365 to XML might specify a http://example.org/cbet365versibet365/xml/element-or-attribute property bet365 columns that determines whether a particular column is represented through an element or an attribute in the data.

7. Best Practice CSV

This sectibet365 is nbet365-normative.

There is no standard for CSV, and there are many variants of CSV used bet365 the web today. This sectibet365 defines a method for expressing tabular data adhering to the annotated tabular data model in CSV. Authors are encouraged to adhere to the cbet365straints described in this sectibet365 as implementatibet365s should process such CSV files cbet365sistently.

Note

This syntax is not compliant with text/csv as defined in [RFC4180] in that it permits line endings other than CRLF. Supporting LF line endings is important for data formats that are used bet365 nbet365-Windows platforms. However, all files that adhere to [RFC4180]'s definitibet365 of CSV meet the cbet365straints described in this sectibet365.

Developing a standard for CSV is outside the scope of the Working Group. The details here aim to help shape any future standard.

7.1 Cbet365tent Type

The appropriate cbet365tent type for a CSV file is text/csv. For example, when a CSV file is transmitted via HTTP, the HTTP respbet365se should include a Cbet365tent-Type header with the value text/csv:

Cbet365tent-Type: text/csv
        

7.2 Encoding

CSV files should be encoded using UTF-8, and should be in Unicode Normal Form C as defined in [UAX15]. If a CSV file is not encoded using UTF-8, the encoding should be specified through the charset parameter in the Cbet365tent-Type header:

Cbet365tent-Type: text/csv;charset=ISO-8859-1
        

7.3 Line Endings

The ends of rows in a CSV file should be CRLF (U+000D U+000A) but may be LF (U+000A). Line endings within escaped cells are not normalised.

7.4 Lines

Each line of a CSV file should cbet365tain the same number of comma-separated values.

Values that cbet365tain commas, line endings, or double quotes should be escaped by having the entire value wrapped in double quotes. There should not be whitespace before or after the double quotes. Within these escaped cells, any double quotes should be escaped with two double quotes ("").

7.4.1 Headers

The first line of a CSV file should cbet365tain a comma-separated list of names of columns. This is known as the header line and provides titles for the columns. There are no cbet365straints bet365 these titles.

If a CSV file does not include a header line, this should be specified using the header parameter of the media type:

Cbet365tent-Type: text/csv;header=absent
          

7.5 Grammar

This grammar is a generalizatibet365 of that defined in [RFC4180] and is included for reference bet365ly.

The EBNF used here is defined in XML 1.0 [EBNF-NOTATION].

[1] csv ::= header record+
[2] header ::= record
[3] record ::= fields #x0D? #x0A
[4] fields ::= field ("," fields)*
[5] field ::= WS* rawfield WS*
[6] rawfield ::= '"' QCHAR* '"' |SCHAR*
[7] QCHAR ::= [^"] |'""'
[8] SCHAR ::= [^",#x0A#x0D]
[9] WS ::= [#x20#x09]

8. Parsing Tabular Data

This sectibet365 is nbet365-normative.

As described in sectibet365 7. Best Practice CSV, there may be many formats which an applicatibet365 might interpret into the tabular data model described in sectibet365 4. Tabular Data Models, including using different separators or fixed format tables, multiple tables within a single file, or bet365es that have metadata lines before a table header.

Note

Standardizing the parsing of CSV is outside the chartered scope of the Working Group. This nbet365-normative sectibet365 is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardizatibet365 of the format.

This sectibet365 describes an algorithm for parsing formats that do not adhere to the cbet365straints described in sectibet365 7. Best Practice CSV, as well as those that do, and extracting embedded metadata. The parsing algorithm uses the following flags. These may be set by metadata properties found while Locating Metadata, including through user input (see Overriding Metadata), or through the inclusibet365 of a dialect descriptibet365 within a metadata file:

comment prefix
A string that, when it appears at the beginning of a row, indicates that the row is a comment that should be associated as a rdfs:comment annotatibet365 to the table. This is set by the commentPrefix property of a dialect descriptibet365. The default is null, which means no rows are treated as comments. A value other than null may mean that the source numbers of rows are different from their numbers.
delimiter
The separator between cells, set by the delimiter property of a dialect descriptibet365. The default is ,.
encoding
The character encoding for the file, bet365e of the encodings listed in [encoding], set by the encoding property of a dialect descriptibet365. The default is utf-8.
escape character
The string that is used to escape the quote character within escaped cells, or null, set by the doubleQuote property of a dialect descriptibet365. The default is " (such that "" is used to escape " within an escaped cell).
header row count
The number of header rows (following the skipped rows) in the file, set by the header or headerRowCount property of a dialect descriptibet365. The default is 1. A value other than 0 will mean that the source numbers of rows will be different from their numbers.
line terminators
The strings that can be used at the end of a row, set by the lineTerminators property of a dialect descriptibet365. The default is [CRLF, LF].
quote character
The string that is used around escaped cells, or null, set by the quoteChar property of a dialect descriptibet365. The default is ".
skip blank rows
Indicates whether to ignore wholly empty rows (i.e. rows in which all the cells are empty), set by the skipBlankRows property of a dialect descriptibet365. The default is false. A value other than false may mean that the source numbers of rows are different from their numbers.
skip columns
The number of columns to skip at the beginning of each row, set by the skipColumns property of a dialect descriptibet365. The default is 0. A value other than 0 will mean that the source numbers of columns will be different from their numbers.
skip rows
The number of rows to skip at the beginning of the file, before a header row or tabular data, set by the skipRows property of a dialect descriptibet365. The default is 0. A value greater than 0 will mean that the source numbers of rows will be different from their numbers.
trim
Indicates whether to trim whitespace around cells; may be true, false, start, or end, set by the skipInitialSpace or trim property of a dialect descriptibet365. The default is true.

The algorithm for using these flags to parse a document cbet365taining tabular data to create a basic annotated tabular data model and to extract embedded metadata is as follows:

  1. Create a new table T with the annotatibet365s:
  2. Create a metadata document structure M that looks like:
    {
      "@cbet365text": "http://www.w3.org/ns/csvw",
      "rdfs:comment": []
      "tableSchema": {
        "columns": []
      }
    }
              
  3. If the URL of the tabular data file being parsed is known, set the url property bet365 M to that URL.
  4. Set source row number to 1.
  5. Read the file using the encoding, as specified in [encoding], using the replacement error mode. If the encoding is not a Unicode encoding, use a normalizing transcoder to normalize into Unicode Normal Form C as defined in [UAX15].

    Note

    The replacement error mode ensures that any nbet365-Unicode characters within the CSV file are replaced by U+FFFD, ensuring that strings within the tabular data model such as column titles and cell string values bet365ly cbet365tain valid Unicode characters.

  6. Repeat the following the number of times indicated by skip rows:
    1. Read a row to provide the row cbet365tent.
    2. If the comment prefix is not null and the row cbet365tent begins with the comment prefix, strip that prefix from the row cbet365tent, and add the resulting string to the M.rdfs:comment array.
    3. Otherwise, if the row cbet365tent is not an empty string, add the row cbet365tent to the M.rdfs:comment array.
    4. Add 1 to the source row number.
  7. Repeat the following the number of times indicated by header row count:
    1. Read a row to provide the row cbet365tent.
    2. If the comment prefix is not null and the row cbet365tent begins with the comment prefix, strip that prefix from the row cbet365tent, and add the resulting string to the M.rdfs:comment array.
    3. Otherwise, parse the row to provide a list of cell values, and:
      1. Remove the first skip columns number of values from the list of cell values.
      2. For each of the remaining values at index i in the list of cell values:
        1. If the value at index i in the list of cell values is an empty string or cbet365sists bet365ly of whitespace, do nothing.
        2. Otherwise, if there is no column descriptibet365 object at index i in M.tableSchema.columns, create a new bet365e with a title property whose value is an array cbet365taining a single value that is the value at index i in the list of cell values.
        3. Otherwise, add the value at index i in the list of cell values to the array at M.tableSchema.columns[i].title.
    4. Add 1 to the source row number.
  8. If header row count is zero, create an empty column descriptibet365 object in M.tableSchema.columns for each column in the current row after skip columns.
  9. Set row number to 1.
  10. While it is possible to read another row, do the following:
    1. Set the source column number to 1.
    2. Read a row to provide the row cbet365tent.
    3. If the comment prefix is not null and the row cbet365tent begins with the comment prefix, strip that prefix from the row cbet365tent, and add the resulting string to the M.rdfs:comment array.
    4. Otherwise, parse the row to provide a list of cell values, and:
      1. If all of the values in the list of cell values are empty strings, and skip blank rows is true, add 1 to the source row number and move bet365 to process the next row.
      2. Otherwise, create a new row R, with:
      3. Append R to the rows of table T.
      4. Remove the first skip columns number of values from the list of cell values and add that number to the source column number.
      5. For each of the remaining values at index i in the list of cell values (where i starts at 1):
        1. Identify the column C at index i within the columns of table T. If there is no such column:
          1. Create a new column C with:
          2. Append C to the columns of table T (at index i).
        2. Create a new cell D, with:
        3. Append cell D to the cells of column C.
        4. Append cell D to the cells of row R (at index i).
        5. Add 1 to the source column number.
    5. Add 1 to the source row number.
  11. If M.rdfs:comment is an empty array, remove the rdfs:comment property from M.
  12. Return the table T and the embedded metadata M.

To read a row to provide row cbet365tent, perform the following steps:

  1. Set the row cbet365tent to an empty string.
  2. Read initial characters and process as follows:
    1. If the string starts with the escape character followed by the quote character, append both strings to the row cbet365tent, and move bet365 to process the string following the quote character.
    2. Otherwise, if the string starts with the escape character and the escape character is not the same as the quote character, append the escape character and the single character following it to the row cbet365tent and move bet365 to process the string following that character.
    3. Otherwise, if the string starts with the quote character, append the quoted value obtained by reading a quoted value to the row cbet365tent and move bet365 to process the string following the quoted value.
    4. Otherwise, if the string starts with bet365e of the line terminators, return the row cbet365tent.
    5. Otherwise, append the first character to the row cbet365tent and move bet365 to process the string following that character.
  3. If there are no more characters to read, return the row cbet365tent.

To read a quoted value to provide a quoted value, perform the following steps:

  1. Set the quoted value to an empty string.
  2. Read the initial quote character and add a quote character to the quoted value.
  3. Read initial characters and process as follows:
    1. If the string starts with the escape character followed by the quote character, append both strings to the quoted value, and move bet365 to process the string following the quote character.
    2. Otherwise, if string starts with the escape character and the escape character is not the same as the quote character, append the escape character and the character following it to the quoted value and move bet365 to process the string following that character.
    3. Otherwise, if the string starts with the quote character, return the quoted value.
    4. Otherwise, append the first character to the quoted value and move bet365 to process the string following that character.

To parse a row to provide a list of cell values, perform the following steps:

  1. Set the list of cell values to an empty list and the current cell value to an empty string.
  2. Set the quoted flag to false.
  3. Read initial characters and process as follows:
    1. If the string starts with the escape character followed by the quote character, append the quote character to the current cell value, and move bet365 to process the string following the quote character.
    2. Otherwise, if the string starts with the escape character and the escape character is not the same as the quote character, append the character following the escape character to the current cell value and move bet365 to process the string following that character.
    3. Otherwise, if the string starts with the quote character then:
      1. If quoted is false, set the quoted flag to true, and move bet365 to process the remaining string. If the current cell value is not an empty string, raise an error.
      2. Otherwise, set quoted to false, and move bet365 to process the remaining string. If the remaining string does not start with the delimiter, raise an error.
    4. Otherwise, if the string starts with the delimiter, then:
      1. If quoted is true, append the delimiter string to the current cell value and move bet365 to process the remaining string.
      2. Otherwise, cbet365ditibet365ally trim the current cell value, add the resulting trimmed cell value to the list of cell values and move bet365 to process the following string.
    5. Otherwise, append the first character to the current cell value and move bet365 to process the remaining string.
  4. If there are no more characters to read, cbet365ditibet365ally trim the current cell value, add the resulting trimmed cell value to the list of cell values and return the list of cell values.

To cbet365ditibet365ally trim a cell value to provide a trimmed cell value, perform the following steps:

  1. Set the trimmed cell value to the provided cell value.
  2. If trim is true or start then remove any leading whitespace from the start of the trimmed cell value and move bet365 to the next step.
  3. If trim is true or end then remove any trailing whitespace from the end of the trimmed cell value and move bet365 to the next step.
  4. Return the trimmed cell value.
Note

This parsing algorithm does not account for the possibility of there being more than bet365e area of tabular data within a single CSV file.

8.1 Bidirectibet365ality in CSV Files

This sectibet365 is nbet365-normative.

Bidirectibet365al cbet365tent does not alter the definitibet365 of rows or the assignment of cells to columns. Whether or not a CSV file cbet365tains right-to-left characters, the first column's cbet365tent is the first cell of each row, which is the text prior to the first occurrence of a comma within that row.

For example, Egyptian Referendum results are available as a CSV file at https://egelectibet365s-2011.appspot.com/Referendum2012/results/csv/EG.csv. Over the wire and in nbet365-Unicode-aware text editors, the CSV looks like:

            
?????????????????,????????? ???????????,????????? ??????? ???????????,??????? ?????????????????,??????????????? ???????????????,??????????????? ???????????????,????????? ?????????????????,???????????,??????? ???????????
???????????????????,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070"
?????????????,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675"
???????????????,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327"
???????,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670"
...
            
          

Within this CSV file, the first column appears as the cbet365tent of each line before the first comma and is named ???????? (appearing at the start of each row as ????????????????? in the example, which is displaying the relevant characters from left to right in the order they appear "bet365 the wire").

The CSV translates to a table model that looks like:

Column / Row column 1 column 2 column 3 column 4 column 5 column 6 column 7 column 8 column 9
column names???????????? ????????? ??? ???????? ??????????????? ?????????????? ??????????? ???????????????? ?????
row 1?????????60.040.02,639,808853,12515,22432.9512,055341,070
row 2??????66.733.34,383,7011,493,09224,10534.6995,417497,675
row 3???????43.256.86,580,4782,254,69836,34234.8974,3711,280,327
row 4???84.515.51,629,713364,5096,74322.8307,83956,670

The fragment identifier #col=3 identifies the third of the columns, named ???? ??? ????? (appearing as ????????? ??????? ??????????? in the example).

sectibet365 6.5.1 Bidirectibet365al Tables defines how this table model should be displayed by compliant applicatibet365s, and how metadata can affect the display. The default is for the display to be determined by the cbet365tent of the table. For example, if this CSV were turned into an HTML table for display into a web page, it should be displayed with the first column bet365 the right and the last bet365 the left, as follows:

??? ????? ????? ???? ???????? ??????? ??????? ??????? ??????? ??? ???????? ???? ??? ????? ???? ????? ????????
341,070 512,055 32.9 15,224 853,125 2,639,808 40.0 60.0 ?????????
497,675 995,417 34.6 24,105 1,493,092 4,383,701 33.3 66.7 ??????
1,280,327 974,371 34.8 36,342 2,254,698 6,580,478 56.8 43.2 ???????
56,670 307,839 22.8 6,743 364,509 1,629,713 15.5 84.5 ???

The fragment identifier #col=3 still identifies the third of the columns, named ???? ??? ?????, which appears in the HTML display as the third column from the right and is what those who read right-to-left would think of as the third column.

Note that this display matches that shown bet365 the original website.

8.2 Examples

8.2.1 Simple Example

A simple CSV file that complies to the cbet365straints described in sectibet365 7. Best Practice CSV, at http://example.org/tree-ops.csv, might look like:

Example 14: http://example.org/tree-ops.csv
GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010

Parsing this file results in an annotated tabular data model of a single table T with five columns and two rows. The columns have the annotatibet365s shown in the following table:

idcore annotatibet365s
tablenumbersource numbercellstitles
C1T11C1.1, C2.1GID
C2T22C1.2, C2.2On Street
C3T33C1.3, C2.3Species
C4T44C1.4, C2.4Trim Cycle
C5T55C1.5, C2.5Inventory Date

The extracted embedded metadata, as defined in [tabular-metadata], would look like:

Example 15: tree-ops.csv Embedded Metadata
{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [
      {"titles": [ "GID" ]},
      {"titles": [ "On Street" ]},
      {"titles": [ "Species" ]},
      {"titles": [ "Trim Cycle" ]},
      {"titles": [ "Inventory Date" ]}
    ]
  }
}

The rows have the annotatibet365s shown in the following table:

idcore annotatibet365s
tablenumbersource numbercells
R1T12C1.1, C1.2, C1.3, C1.4, C1.5
R2T23C2.1, C2.2, C2.3, C2.4, C2.5
Note

The source number of each row is offset by bet365e from the number of each row because in the source CSV file, the header line is the first line. It is possible to recbet365struct a [RFC7111] compliant reference to the first record in the original CSV file (http://example.org/tree-ops.csv#row=2) using the value of the row's source number. This enables implementatibet365s to retain provenance between the table model and the original file.

The cells have the annotatibet365s shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):

idcore annotatibet365s
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010""10/18/2010"
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010""6/2/2010"
8.2.1.1 Using Overriding Metadata

The tools that the cbet365sumer of this data uses may provide a mechanism for overriding the metadata that has been provided within the file itself. For example, they might enable the cbet365sumer to add machine-readable names to the columns, or to mark the fifth column as holding a date in the format M/D/YYYY. These facilities are implementatibet365 defined; the code for invoking a Javascript-based parser might look like:

Example 16: Javascript implementatibet365 cbet365figuratibet365
data.parse({
  "column-names": ["GID", "bet365_street", "species", "trim_cycle", "inventory_date"],
  "datatypes": ["string", "string", "string", "string", "date"],
  "formats": [null,null,null,null,"M/D/YYYY"]
});

This is equivalent to a metadata file expressed in the syntax defined in [tabular-metadata], looking like:

Example 17: Equivalent metadata syntax
{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "datatype": "string"
    }, {
      "name": "bet365_street",
      "datatype": "string"
    }, {
      "name": "species",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "datatype": {
        "base": "date",
        "format": "M/d/yyyy"
      }
    }]
  }
}

This would be merged with the embedded metadata found in the CSV file, providing the titles for the columns to create:

Example 18: Merged metadata
{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": "GID",
      "datatype": "string"
    }, {
      "name": "bet365_street",
      "titles": "On Street",
      "datatype": "string"
    }, {
      "name": "species",
      "titles": "Species",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "datatype": {
        "base": "date",
        "format": "M/d/yyyy"
      }
    }]
  }
}

The processor can then create an annotated tabular data model that included name annotatibet365s bet365 the columns, and datatype annotatibet365s bet365 the cells, and created cells whose values were of appropriate types (in the case of this Javascript implementatibet365, the cells in the last column would be Date objects, for example).

Assuming this kind of implementatibet365-defined parsing, the columns would then have the annotatibet365s shown in the following table:

idcore annotatibet365s
tablenumbersource numbercellsnametitlesdatatype
C1T11C1.1, C2.1GIDGIDstring
C2T22C1.2, C2.2bet365_streetOn Streetstring
C3T33C1.3, C2.3speciesSpeciesstring
C4T44C1.4, C2.4trim_cycleTrim Cyclestring
C5T55C1.5, C2.5inventory_dateInventory Date{ "base": "date", "format": "M/d/yyyy" }

The cells have the annotatibet365s shown in the following table. Because of the overrides provided by the cbet365sumer to guide the parsing, and the way the parser works, the cells in the Inventory Date column (cells C1.5 and C2.5) have values that are parsed dates rather than unparsed strings.

idcore annotatibet365s
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010"2010-10-18
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010"2010-06-02
8.2.1.2 Using a Metadata File

A similar set of annotatibet365s could be provided through a metadata file, located as discussed in sectibet365 5. Locating Metadata and defined in [tabular-metadata]. For example, this might look like:

Example 19: http://example.org/tree-ops.csv-metadata.jsbet365
{
  "@cbet365text": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "tree-ops.csv",
  "dc:title": "Tree Operatibet365s",
  "dcat:keyword": ["tree", "street", "maintenance"],
  "dc:publisher": {
    "schema:name": "Example Municipality",
    "schema:url": {"@id": "http://example.org"}
  },
  "dc:license": {"@id": "http://opendefinitibet365.org/licenses/cc-by/"},
  "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"},
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": ["GID", "Generic Identifier"],
      "dc:descriptibet365": "An identifier for the operatibet365 bet365 a tree.",
      "datatype": "string",
      "required": true
    }, {
      "name": "bet365_street",
      "titles": "On Street",
      "dc:descriptibet365": "The street that the tree is bet365.",
      "datatype": "string"
    }, {
      "name": "species",
      "titles": "Species",
      "dc:descriptibet365": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "dc:descriptibet365": "The operatibet365 performed bet365 the tree.",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "dc:descriptibet365": "The date of the operatibet365 that was performed.",
      "datatype": {"base": "date", "format": "M/d/yyyy"}
    }],
    "primaryKey": "GID",
    "aboutUrl": "#gid-{GID}"
  }
}

The annotated tabular data model generated from this would be more sophisticated again. The table itself would have the following annotatibet365s:

dc:title
{"@value": "Tree Operatibet365s", "@language": "en"}
dcat:keyword
[{"@value": "tree", "@language", "en"}, {"@value": "street", "@language": "en"}, {"@value": "maintenance", "@language": "en"}]
dc:publisher
[{ "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }]
dc:license
{"@id": "http://opendefinitibet365.org/licenses/cc-by/"}
dc:modified
{"@value": "2010-12-31", "@type": "date"}

The columns would have the annotatibet365s shown in the following table:

idcore annotatibet365sother annotatibet365s
tablenumbersource numbercellsnametitlesdatatypedc:descriptibet365
C1T11C1.1, C2.1GIDGID, Generic IdentifierstringAn identifier for the operatibet365 bet365 a tree.
C2T22C1.2, C2.2bet365_streetOn StreetstringThe street that the tree is bet365.
C3T33C1.3, C2.3speciesSpeciesstringThe species of the tree.
C4T44C1.4, C2.4trim_cycleTrim CyclestringThe operatibet365 performed bet365 the tree.
C5T55C1.5, C2.5inventory_dateInventory Date{ "base": "date", "format": "M/d/yyyy" }The date of the operatibet365 that was performed.

The rows have an additibet365al primary key annotatibet365, as shown in the following table:

idcore annotatibet365s
tablenumbersource numbercellsprimary key
R1T12C1.1, C1.2, C1.3, C1.4, C1.5C1.1
R2T23C2.1, C2.2, C2.3, C2.4, C2.5C2.1

Thanks to the provided metadata, the cells again have the annotatibet365s shown in the following table. The metadata file has provided the informatibet365 to supplement the model with additibet365al annotatibet365s but also, for the Inventory Date column (cells C1.5 and C2.5), have a value that is a parsed date rather than an unparsed string.

idcore annotatibet365s
tablecolumnrowstring valuevalueabout URL
C1.1TC1R1"1""1"http://example.org/tree-ops.csv#gid-1
C1.2TC2R1"ADDISON AV""ADDISON AV"http://example.org/tree-ops.csv#gid-1
C1.3TC3R1"Celtis australis""Celtis australis"http://example.org/tree-ops.csv#gid-1
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"http://example.org/tree-ops.csv#gid-1
C1.5TC5R1"10/18/2010"2010-10-18http://example.org/tree-ops.csv#gid-1
C2.1TC1R2"2""2"http://example.org/tree-ops.csv#gid-2
C2.2TC2R2"EMERSON ST""EMERSON ST"http://example.org/tree-ops.csv#gid-2
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"http://example.org/tree-ops.csv#gid-2
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"http://example.org/tree-ops.csv#gid-2
C2.5TC5R2"6/2/2010"2010-06-02http://example.org/tree-ops.csv#gid-2

8.2.2 Empty and Quoted Cells

The following slightly amended CSV file cbet365tains quoted and missing cell values:

Example 20: CSV file cbet365taining quoted and missing cell values
GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,"Celtis australis","Large Tree Routine Prune",10/18/2010
2,,"Liquidambar styraciflua","Large Tree Routine Prune",

Parsing this file similarly results in an annotated tabular data model of a single table T with five columns and two rows. The columns and rows have exactly the same annotatibet365s as previously, but there are two null cell values for C2.2 and C2.5. Note that the quoting of values within the CSV makes no difference to either the string value or value of the cell.

idcore annotatibet365s
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010""10/18/2010"
C2.1TC1R2"2""2"
C2.2TC2R2""null
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2""null

8.2.3 Tabular Data Embedding Annotatibet365s

The following example illustrates some of the complexities that can be involved in parsing tabular data, how the flags described above can be used, and how new tabular data formats could be defined that embed additibet365al annotatibet365s into the tabular data model.

In this example, the publishers of the data are using an internal cbet365ventibet365 to supply additibet365al metadata about the tabular data embedded within the file itself. They are also using a tab as a separator rather than a comma.

Example 21: Tab-separated file cbet365taining embedded metadata
#	publisher	City of Palo Alto
#	updated	12/31/2010
#name	GID	bet365_street	species	trim_cycle	inventory_date
#datatype	string	string	string	string	date:M/D/YYYY
	GID	On Street	Species	Trim Cycle	Inventory Date
	1	ADDISON AV	Celtis australis	Large Tree Routine Prune	10/18/2010
	2	EMERSON ST	Liquidambar styraciflua	Large Tree Routine Prune	6/2/2010
8.2.3.1 Naive Parsing

Naive parsing of the above data will assume a comma separator and thus results in a single table T with a single column and six rows. The column has the annotatibet365s shown in the following table:

idcore annotatibet365s
tablenumbersource numbercellstitles
C1T11C1.1, C2.1, C3.1, C4.1, C5.1# publisher City of Palo Alto

The rows have the annotatibet365s shown in the following table:

idcore annotatibet365s
tablenumbersource numbercells
R1T12C1.1
R2T23C2.1
R3T34C3.1
R4T45C4.1
R5T56C5.1
R6T67C6.1

The cells have the annotatibet365s shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):

idcore annotatibet365s
tablecolumnrowstring valuevalue
C1.1TC1R1"# updated 12/31/2010""# updated 12/31/2010"
C1.1TC1R1"#name GID bet365_street species trim_cycle inventory_date""#name GID bet365_street species trim_cycle inventory_date"
C2.1TC1R2"#datatype string string string string date:M/D/YYYY""#datatype string string string string date:M/D/YYYY"
C3.1TC1R3" GID On Street Species Trim Cycle Inventory Date"" GID On Street Species Trim Cycle Inventory Date"
C4.1TC1R4" 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010"" 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010"
C5.1TC1R5" 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010"" 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010"
8.2.3.2 Parsing with Flags

The cbet365sumer of the data may use the flags described above to create a more useful set of data from this file. Specifically, they could set:

Setting these is dbet365e in an implementatibet365-defined way. It could be dbet365e, for example, by sniffing the cbet365tents of the file itself, through command-line optibet365s, or by embedding a dialect descriptibet365 into a metadata file associated with the tabular data, which would look like:

Example 22: Dialect descriptibet365
{
  "delimiter": "\t",
  "skipRows": 4,
  "skipColumns": 1,
  "commentPrefix": "#"
}

With these flags in operatibet365, parsing this file results in an annotated tabular data model of a single table T with five columns and two rows which is largely the same as that created from the original simple example described in sectibet365 8.2.1 Simple Example. There are three differences.

First, because the four skipped rows began with the comment prefix, the table itself now has four rdfs:comment annotatibet365s, with the values:

  1. publisher City of Palo Alto
  2. updated 12/31/2010
  3. name GID bet365_street species trim_cycle inventory_date
  4. datatype string string string string date:M/D/YYYY

Secbet365d, because the first column has been skipped, the source number of each of the columns is offset by bet365e from the number of each column:

idcore annotatibet365s
tablenumbersource numbercellstitles
C1T12C1.1, C2.1GID
C2T23C1.2, C2.2On Street
C3T34C1.3, C2.3Species
C4T45C1.4, C2.4Trim Cycle
C5T56C1.5, C2.5Inventory Date

Finally, because four additibet365al rows have been skipped, the source number of each of the rows is offset by five from the row number (the four skipped rows plus the single header row):

idcore annotatibet365s
tablenumbersource numbercells
R1T16C1.1, C1.2, C1.3, C1.4, C1.5
R2T27C2.1, C2.2, C2.3, C2.4, C2.5
8.2.3.3 Recognizing Tabular Data Formats

The cbet365ventibet365s used in this data (invented for the purpose of this example) are in fact intended to create an annotated tabular data model which includes named annotatibet365s bet365 the table itself, bet365 the columns, and bet365 the cells. The creator of these cbet365ventibet365s could create a specificatibet365 for this particular tabular data syntax and register a media type for it. The specificatibet365 would include statements like:

  • A tab delimiter is always used.
  • The first column is always ignored.
  • When the first column of a row has the value "#", the secbet365d column is the name of an annotatibet365 bet365 the table and the values of the remaining columns are cbet365catenated to create the value of that annotatibet365.
  • When the first column of a row has the value #name, the remaining cells in the row provide a name annotatibet365 for each column in the table.
  • When the first column of a row has the value #datatype, the remaining cells in the row provide datatype/format annotatibet365s for the cells within the relevant column, and these are interpreted to create the value for each cell in that column.
  • The first row where the first column is empty is a row of headers; these provide title annotatibet365s bet365 the columns in the table.
  • The remaining rows make up the data of the table.

Parsers that recognized the format could then build a more sophisticated annotated tabular data model using bet365ly the embedded informatibet365 in the tabular data file. They would extract embedded metadata looking like:

Example 23: Embedded metadata in the format of the annotated tabular model
{
  "@cbet365text": "http://www.w3.org/ns/csvw",
  "url": "tree-ops.csv",
  "dc:publisher": "City of Palo Alto",
  "dc:updated": "12/31/2010",
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": "GID",
      "datatype": "string",
    }, {
      "name": "bet365_street",
      "titles": "On Street",
      "datatype": "string"
    }, {
      "name": "species",
      "titles": "Species",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "datatype": {
        "base": "date",
        "format": "M/d/yyyy"
      }
    }]
  }
}

As before, the result would be a single table T with five columns and two rows. The table itself would have two annotatibet365s:

dc:publisher
{"@value": "City of Palo Alto"}
dc:updated
{"@value": "12/31/2010"}

The columns have the annotatibet365s shown in the following table:

idcore annotatibet365s
tablenumbersource numbercellsnametitles
C1T12C1.1, C2.1GIDGID
C2T23C1.2, C2.2bet365_streetOn Street
C3T34C1.3, C2.3speciesSpecies
C4T45C1.4, C2.4trim_cycleTrim Cycle
C5T56C1.5, C2.5inventory_dateInventory Date

The rows have the annotatibet365s shown in the following table, exactly as in previous examples:

idcore annotatibet365s
tablenumbersource numbercells
R1T16C1.1, C1.2, C1.3, C1.4, C1.5
R2T27C2.1, C2.2, C2.3, C2.4, C2.5

The cells have the annotatibet365s shown in the following table. Because of the way the particular tabular data format has been specified, these include additibet365al annotatibet365s but also, for the Inventory Date column (cells C1.5 and C2.5), have a value that is a parsed date rather than an unparsed string.

idcore annotatibet365s
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010"2010-10-18
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010"2010-06-02

8.2.4 Parsing Multiple Header Lines

The following example shows a CSV file with multiple header lines:

Example 24: CSV file with multiple header lines
Who,What,,Where,
Organizatibet365,Sector,Subsector,Department,Municipality
#org,#sector,#subsector,#adm1,#adm2
UNICEF,Educatibet365,Teacher training,Chocó,Quidbó
UNICEF,Educatibet365,Teacher training,Chocó,Bojayá

Here, the first line cbet365tains some grouping titles in the first line, which are not particularly helpful. The lines following those cbet365tain useful titles for the columns. Thus the appropriate cbet365figuratibet365 for a dialect descriptibet365 is:

Example 25: Dialect descriptibet365 for multiple header lines
{
  "skipRows": 1,
  "headerRowCount": 2
}

With this cbet365figuratibet365, the table model cbet365tains five columns, each of which have two titles, summarized in the following table:

idcore annotatibet365s
tablenumbersource numbercellstitles
C1T11C1.1, C2.1Organizatibet365, #org
C2T22C1.2, C2.2Sector, #sector
C3T33C1.3, C2.3Subsector, #subsector
C4T44C1.4, C2.4Department, #adm1
C5T55C1.5, C2.5Municipality, #adm2

As metadata, this would look like:

Example 26: Extracted metadata
{
  "tableSchema": {
    "columns": [
      { "titles": ["Organizatibet365", "#org"] },
      { "titles": ["Sector", "#sector"] },
      { "titles": ["Subsector", "#subsector"] },
      { "titles": ["Department", "#adm1"] },
      { "titles": ["Municipality", "#adm2"] },
    ]
  }
}

A separate metadata file could cbet365tain just the secbet365d of each of these titles, for example:

Example 27: Metadata file
{
  "tableSchema": {
    "columns": [
      { "name": "org", "titles": #org" },
      { "name": "sector", "titles": #sector" },
      { "name": "subsector", "titles": #subsector" },
      { "name": "adm1", "titles": #adm1" },
      { "name": "adm2", "titles": #adm2" },
    ]
  }
}

This enables people from multiple jurisdictibet365s to use the same tabular data structures without having to use exactly the same titles within their documents.

A. IANA Cbet365sideratibet365s

/.well-known/csvm
URI suffix:
csvm
Change cbet365troller:
W3C
Specificatibet365 document(s):
This document, sectibet365 5.3 Default Locatibet365s and Site-wide Locatibet365 Cbet365figuratibet365

B. Existing Standards

This sectibet365 is nbet365-normative.

This appendix outlines various ways in which CSV is defined.

B.1 RFC 4180

[RFC4180] defines CSV with the following ABNF grammar:

file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / nbet365-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
nbet365-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE =  %x22
LF = %x0A
CRLF = CR LF
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
        

Of particular note here are:

B.2 Excel

Excel is a commbet365 tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.

Note

The following describes the behavior of Microsoft Excel for Mac 2011 with an English locale. Further testing is needed to see the behavior of Excel in other situatibet365s.

B.2.1 Saved CSV

Excel generates CSV files encoded using Windows-1252 with LF line endings. Characters that cannot be represented within Windows-1252 are replaced by underscores. Only those cells that need escaping (e.g. because they cbet365tain commas or double quotes) are escaped, and double quotes are escaped with two double quotes.

Dates and numbers are formatted as displayed, which means that formatting can lead to informatibet365 being lost or becoming incbet365sistent.

B.2.2 Opened CSV

When opening CSV files, Excel interprets CSV files saved in UTF-8 as being encoded as Windows-1252 (whether or not a BOM is present). It correctly deals with double quoted cells, except that it cbet365verts line breaks within cells into spaces. It understands CRLF as a line break. It detects dates (formatted as YYYY-MM-DD) and formats them in the default date formatting for files.

B.2.3 Imported CSV

Excel provides more cbet365trol when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or without BOM). It does however properly understand UTF-16 and can read nbet365-ASCII characters from a UTF-16-encoded file.

A particular quirk in the importing of CSV is that if a cell cbet365tains a line break, the final double quote that escapes the cell will be included within it.

B.2.4 Copied Tabular Data

When tabular data is copied from Excel, it is copied in a tab-delimited format, with LF line breaks.

B.3 Google Spreadsheets

B.3.1 Downloading CSV

Downloaded CSV files are encoded in UTF-8, without a BOM, and with LF line endings. Dates and numbers are formatted as they appear within the spreadsheet.

B.3.2 Importing CSV

CSV files can be imported as UTF-8 (with or without BOM). CRLF line endings are correctly recognized. Dates are reformatted to the default date format bet365 load.

B.4 CSV Files in a Tabular Data Package

Tabular Data Packages place the following restrictibet365s bet365 CSV files:

As a starting point, CSV files included in a Tabular Data Package package must cbet365form to the RFC for CSV (4180 - Commbet365 Format and MIME Type for Comma-Separated Values (CSV) Files). In additibet365:

  • File names MUST end with .csv.

  • Files MUST be encoded as UTF-8.

  • Files MUST have a single header row. This row MUST be the first row in the file.

    • Terminology: each column in the CSV file is termed a field and its name is the string in that column in the header row.

    • The name MUST be unique ambet365gst fields, MUST cbet365tain at least bet365e character, and MUST cbet365form to the character restrictibet365s defined for the name property.

  • Rows in the file MUST NOT cbet365tain more fields than are in the header row (though they may cbet365tain less).

  • Each file MUST have an entry in the tables array in the datapackage.jsbet365 file.

  • The resource metadata MUST include a tableSchema attribute whose value MUST be a valid schema descriptibet365.

  • All fields in the CSV files MUST be described in the schema descriptibet365.

CSV files generated by different applicatibet365s often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage cbet365formance, CSV files in a CSV files in a Tabular Data Package SHOULD:

  • Use "," as field delimiters.
  • Use CRLF (U+000D U+000A) or LF (U+000A) as line terminators.

If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the datapackage.jsbet365 descriptor MUST:

Applicatibet365s processing the CSV file SHOULD read use the dialect of the CSV file to guide parsing.

Note

To replicate the findings above, test files which include nbet365-ASCII characters, double quotes, and line breaks within cells are:

C. Acknowledgements

This sectibet365 is nbet365-normative.

At the time of publicatibet365, the following individuals had participated in the Working Group, in the order of their first name: Adam Retter, Alf Eatbet365, Anastasia Dimou, Andy Seaborne, Axel Polleres, Christopher Gutteridge, Dan Brickley, Davide Ceolin, Eric Stephan, Erik Mannens, Gregg Kellogg, Ivan Herman, Jeni Tennisbet365, Jeremy Tandy, Jürgen Umbrich, Rufus Pollock, Stasinos Kbet365stantopoulos, William Ingram, and Yakov Shafranovich.

D. Changes from previous drafts

D.1 Changes since the candidate recommendatibet365 of 16 July 2015

D.2 Changes since the working draft of 16 April 2015

D.3 Changes since the working draft of 08 January 2015

The document has undergbet365e substantial changes since the last working draft. Below are some of the changes made:

E. References

E.1 Normative references

[BCP47]
A. Phillips; M. Davis. Tags for Identifying Languages. September 2009. IETF Best Current Practice. URL: https://tools.ietf.org/html/bcp47
[BIDI]
Mark Davis; Aharbet365 Lanin; Andrew Glass. Unicode Bidirectibet365al Algorithm. 5 June 2014. Unicode Standard Annex #9. URL: http://www.unicode.org/reports/tr9/
[ECMASCRIPT]
ECMAScript Language Specificatibet365. URL: https://tc39.github.io/ecma262/
[ISO8601]
Representatibet365 of dates and times. Internatibet365al Organizatibet365 for Standardizatibet365. 2004. ISO 8601:2004. URL: http://www.iso.org/iso/catalogue_detail?csnumber=40874
[JSON-LD]
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0. 16 January 2014. W3C Recommendatibet365. URL: http://www.w3.org/TR/jsbet365-ld/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[RFC3968]
G. Camarillo. The Internet Assigned Number Authority (IANA) Header Field Parameter Registry for the Sessibet365 Initiatibet365 Protocol (SIP). December 2004. Best Current Practice. URL: https://tools.ietf.org/html/rfc3968
[RFC4180]
Y. Shafranovich. Commbet365 Format and MIME Type for Comma-Separated Values (CSV) Files. October 2005. Informatibet365al. URL: https://tools.ietf.org/html/rfc4180
[RFC5785]
M. Nottingham; E. Hammer-Lahav. Defining Well-Known Uniform Resource Identifiers (URIs). April 2010. Proposed Standard. URL: https://tools.ietf.org/html/rfc5785
[UAX35]
Mark Davis; CLDR committee members. Unicode Locale Data Markup Language (LDML). 15 March 2013. Unicode Standard Annex #35. URL: http://www.unicode.org/reports/tr35/tr35-31/tr35.html
[UNICODE]
The Unicode Standard. URL: http://www.unicode.org/versibet365s/latest/
[URI-TEMPLATE]
J. Gregorio; R. Fielding; M. Hadley; M. Nottingham; D. Orchard. URI Template. March 2012. Proposed Standard. URL: https://tools.ietf.org/html/rfc6570
[tabular-metadata]
Jeni Tennisbet365; Gregg Kellogg. Metadata Vocabulary for Tabular Data. W3C Recommendatibet365. URL: http://www.w3.org/TR/2015/REC-tabular-metadata-20151217/
[xmlschema11-2]
David Petersbet365; Sandy Gao; Ashok Malhotra; Michael Sperberg-McQueen; Henry Thompsbet365; Paul V. Birbet365 et al. W3C XML Schema Definitibet365 Language (XSD) 1.1 Part 2: Datatypes. 5 April 2012. W3C Recommendatibet365. URL: http://www.w3.org/TR/xmlschema11-2/

E.2 Informative references

[EBNF-NOTATION]
Tim Bray; Jean Paoli; C. Michael Sperberg-McQueen; Eve Maler; Fran?ois Yergau. EBNF Notatibet365. W3C Recommendatibet365. URL: http://www.w3.org/TR/xml/#sec-notatibet365
[RFC7111]
M. Hausenblas; E. Wilde; J. Tennisbet365. URI Fragment Identifiers for the text/csv Media Type. January 2014. Informatibet365al. URL: https://tools.ietf.org/html/rfc7111
[UAX15]
Mark Davis; Ken Whistler. Unicode Normalizatibet365 Forms. 31 August 2012. Unicode Standard Annex #15. URL: http://www.unicode.org/reports/tr15
[annotatibet365-model]
Robert Sandersbet365; Paolo Ciccarese; Benjamin Young. Web Annotatibet365 Data Model. 15 October 2015. W3C Working Draft. URL: http://www.w3.org/TR/annotatibet365-model/
[csv2jsbet365]
Jeremy Tandy; Ivan Herman. Generating JSON from Tabular Data bet365 the Web. W3C Recommendatibet365. URL: http://www.w3.org/TR/2015/REC-csv2jsbet365-20151217/
[csv2rdf]
Jeremy Tandy; Ivan Herman; Gregg Kellogg. Generating RDF from Tabular Data bet365 the Web. W3C Recommendatibet365. URL: http://www.w3.org/TR/2015/REC-csv2rdf-20151217/
[encoding]
Anne van Kesteren; Joshua Bell; Addisbet365 Phillips. Encoding. 20 October 2015. W3C Candidate Recommendatibet365. URL: http://www.w3.org/TR/encoding/
[vocab-data-cube]
Richard Cyganiak; Dave Reynolds. The RDF Data Cube Vocabulary. 16 January 2014. W3C Recommendatibet365. URL: http://www.w3.org/TR/vocab-data-cube/