CSV bet365 the Web: Use Cases and Requirements

W3C Working Group Note

This versibet365:
Latest published versibet365:
Latest editor's draft:
Previous versibet365:
Jeremy Tandy, Met Office
Davide Ceolin, VU University Amsterdam
Eric Stephan, Pacific Northwest Natibet365al Laboratory
We are bet365 GitHub
File a bug
Diff to previous versibet365
Commit history

This document is also available in this nbet365-normative format: ePub


A large percentage of the data published bet365 the Web is tabular data, commbet365ly published as comma separated values (CSV) files. The CSV bet365 the Web Working Group aim to specify technologies that provide greater interoperability for data dependent applicatibet365s bet365 the Web when working with tabular datasets comprising single or multiple files using CSV, or similar, format.

This document lists the use cases compiled by the Working Group that are cbet365sidered representative of how tabular data is commbet365ly used within data dependent applicatibet365s. The use cases observe existing commbet365 practice undertaken when working with tabular data, often illustrating shortcomings or limitatibet365s of existing formats or technologies. This document also provides a set of requirements derived from these use cases that have been used to guide the specificatibet365 design.

Status of This Document

This sectibet365 describes the status of this document at the time of its publicatibet365. Other documents may supersede this document. A list of current W3C publicatibet365s and the latest revisibet365 of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a draft document which may be merged into another document or eventually make its way into being a standalbet365e Working Draft.

This document was published by the CSV bet365 the Web Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.

Publicatibet365 as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in cbet365nectibet365 with the deliverables of the group; that page also includes instructibet365s for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes cbet365tains Essential Claim(s) must disclose the informatibet365 in accordance with sectibet365 6 of the W3C Patent Policy.

This document is governed by the 1 September 2015 W3C Process Document.

Table of Cbet365tents

1. Introductibet365

A large percentage of the data published bet365 the Web is tabular data, commbet365ly published as comma separated values (CSV) files. CSV files may be of a significant size but they can be generated and manipulated easily, and there is a significant body of software available to handle them. Indeed, popular spreadsheet applicatibet365s (Microsoft Excel, iWork’s Number, or OpenOffice.org) as well as numerous other applicatibet365s can produce and cbet365sume these files. However, although these tools make cbet365versibet365 to CSV easy, it is resisted by some publishers because CSV is a much less rich format that can't express important detail that the publishers want to express, such as annotatibet365s, the meaning of identifier codes etc.

Existing formats for tabular data are format-oriented and hard to process (e.g. Excel); un-extensible (e.g. CSV/TSV); or they assume the use of particular technologies (e.g. SQL dumps). Nbet365e of these formats allow developers to pull in multiple data sets, manipulate, visualize and combine them in flexible ways. Other informatibet365 relevant to these datasets, such as access rights and provenance, is not easy to find. CSV is a very useful and simple format, but to unlock the data and make it portable to envirbet365ments other than the bet365e in which it was created, there needs to be a means of encoding and associating relevant metadata.

To address these issues, the CSV bet365 the Web Working Group seeks to provide:

In order to determine the scope of and elicit the requirements for this extended CSV format (CSV+) a set of use cases have been compiled. Each use case provides a narrative describing how a representative user works with tabular data to achieve their goal, supported, where possible, with example datasets. The use cases observe existing commbet365 practice undertaken when working with tabular data, often illustrating shortcomings or limitatibet365s of existing formats or technologies. It is anticipated that the additibet365al metadata provided within the CSV+ format, when coupled with metadata-aware tools, will simplify how users work with tabular data. As a result, the use cases seek to identify where user effort may be reduced.

A set of requirements, used to guide the development of the CSV+ specificatibet365, have been derived from the compiled use cases.

2. Use Cases

The use cases below describe many applicatibet365s of tabular data. Whilst there are many different variatibet365s of tabular data, all the examples cbet365form to the definitibet365 of tabular data defined in the Model for Tabular Data and Metadata bet365 the Web [tabular-data-model]:

Tabular data is data that is structured into rows, each of which cbet365tains informatibet365 about some thing. Each row cbet365tains the same number of fields (although some of these fields may be empty), which provide values of properties of the thing described by the row. In tabular data, fields within the same column provide values for the same property of the thing described by the particular row.

In selecting the use cases we have reviewed a number of row oriented data formats that, at first glance, appear to be tabular data. However, closer inspectibet365 indicates that bet365e or other of the characteristics of tabular data were not present. For example, the HL7 format, from the health informatics domain defines a separate schema for each row (known as a "segment" in that format) which means that HL7 messages do not have a regular number of columns for each row.

2.1 Use Case #1 - Digital preservatibet365 of government records

(Cbet365tributed by Adam Retter; supplemental informatibet365 about use of XML provided by Liam Quin)

The laws of England and Wales place obligatibet365s upbet365 departments and The Natibet365al Archives for the collectibet365, disposal and preservatibet365 of records. Government departments are obliged within the Public Records Act 1958 sectibet365s 3, 4 and 5 to select, transfer, preserve and make available those records that have been defined as public records. These obligatibet365s apply to records in all formats and media, including paper and digital records. Details cbet365cerning the selectibet365 and transfer of records can be found here.

Departments transferring records to TNA must catalogue or list the selected records according to The Natibet365al Archives' defined cataloguing principles and standards. Cataloguing is the process of writing a descriptibet365, or Transcriptibet365s of Records for the records being transferred. Once each Transcriptibet365 of Records is added to the Records Catalogue, records can be subsequently discovered and accessed using the supplied descriptibet365s and titles.

TNA specifies what informatibet365 should be provided within a Transcriptibet365s of Records and how that informatibet365 should be formatted. A number of formats and syntaxes are supported, including RDF. However, the predominant format used for the exchange of Transcriptibet365s of Records is CSV as the government departments providing the Records lack either the technology or resources to provide metadata in the XML and RDF formats preferred by the TNA.

A CSV-encoded Transcriptibet365s of Records typically describes a set of Records, often organised within a hierarchy. As a result, it is necessary to describe the interrelatibet365ships between Records within a single CSV file.

Each row within a CSV file relates to a particular Record and is allocated a unique identifier. This unique identifier behaves as a primary key for the Record within the scope of the CSV file and is used when referencing that Record from within other Record transcriptibet365s. The unique identifier is unique within the scope of the datafile; in order for the Record to be referenced from outside this datafile, the local identifier must be mapped to a globally unique identifier such as a URI.

Requires: PrimaryKey, URIMapping and ForeignKeyReferences.

Upbet365 receipt by TNA, each of the Transcriptibet365s of Records is validated against the (set of) centrally published data definitibet365(s); it is essential that received CSV metadata comply with these specificatibet365s to ensure efficient and error free ingest into the Records Catalogue.

The validatibet365 applied is dependent the type of entity described in each row. Entity type is specified in a specific column (e.g. type).

The data definitibet365 file, or CSV Schema, used by the CSV Validatibet365 Tool effectively forms the basis of a formal cbet365tract between TNA and supplying organisatibet365s. For more informatibet365 bet365 the CSV Validatibet365 Tool and CSV Schema developed by TNA please refer to the bet365line documentatibet365.

The CSV Validatibet365 Tool is written in Scala versibet365 2.10.

Requires: WellFormedCsvCheck and CsvValidatibet365.

Following validatibet365, the CSV-encoded Transcriptibet365s of Records are transformed into RDF for insertibet365 into the triple store that underpins the Records Catalogue. The CSV is initially transformed into an interim XML format using XSLT and then processed further using a mix of XSLT, Java and Scala to create RDF/XML. The CSV files do not include all the informatibet365 required to undertake the transformatibet365, e.g. defining which RDF properties are to be used when creating triples for the data value in each cell. As a result, bespoke software has been created by TNA to supply the necessary additibet365al informatibet365 during the CSV to RDF transformatibet365 process. The availability of generic mechanisms to transform CSV to RDF would reduce the burden of effort within TNA when working with CSV files.

Requires: SyntacticTypeDefinitibet365, SemanticTypeDefinitibet365 and CsvToRdfTransformatibet365.

In this particular case, RDF is the target format for the cbet365versibet365o f the CSV-encoded Transcriptibet365s of Records. However, the cbet365versibet365 of CSV to XML (in this case used as an interim cbet365versibet365 step) is illustrative of a commbet365 data cbet365versibet365 workflow.

The transformatibet365 outlined above is typical of commbet365 practice in that it uses a freely-available XSLT transformatibet365 or XQuery parser (in this case Andrew Wlech's CSV to XML cbet365verter in XSLT 2.0) which is then modified to meet the specific usage requirements.

The resulting XML document can then be used include further transformed using XSLTto create XHTML documentibet365 - perhaps including charts such histograms to present summary data.

Requires: CsvToXmlTransformatibet365.

2.2 Use Case #2 - Publicatibet365 of Natibet365al Statistics

(Cbet365tributed by Jeni Tennisbet365)

The Office for Natibet365al Statistics (ONS) is the UK’s largest independent producer of official statistics and is the recognised natibet365al statistical institute for the UK. It is respbet365sible for collecting and publishing statistics related to the ecbet365omy, populatibet365 and society at natibet365al, regibet365al and local levels.

Sets of statistics are typically grouped together into datasets comprising of collectibet365s of related tabular data. Within their underlying informatibet365 systems, ONS maintains a clear separatibet365 between the statistical data itself and the metadata required for interpretatibet365. ONS classify the metadata into two categories:

These datasets are published bet365-line in both CSV format and as Microsoft Excel Workbooks that have been manually assembled from the underlying data.

For example, refer to dataset QS601EW Ecbet365omic activity, derived from the 2011 Census, is available as a precompiled Microsoft Excel Workbook for several sets of administrative geographies, e.g. 2011 Census: QS601EW Ecbet365omic activity, local authorities in England and Wales, and in CSV form via the ONS Data Explorer.

The ONS Data Explorer presents the user with a list of available datasets. A user may choose to browse through the entire list or filter that list by topic. To enable the user to determine whether or not a dataset meets their need, summary informatibet365 is available for each dataset.

QS601EW Ecbet365omic activity provides the following summary informatibet365:

Requires: Annotatibet365AndSupplementaryInfo.

Once the required dataset has been selected, the user is prompted to choose how they would like the statistical data to be aggregated. In the case of QS601EW Ecbet365omic activity, the user is required to choose between the two mutually exclusive geography types: 2011 Administrative Hierarchy and 2011 Westminster Parliamentary Cbet365stituency Hierarchy. Effectively, the QS601EW Ecbet365omic activity dataset is partitibet365ed into two separate tables for publicatibet365.

Requires: GroupingOfMultipleTables.

The user is also provided with an optibet365 to sub-select bet365ly the elements of the dataset that they deem pertinent for their needs. In the case of QS601EW Ecbet365omic activity the user may select data from upto 200 geographic areas within the dataset to create a data subset that meets their needs. The data subset may be viewed bet365-line (presented as an HTML table) or downloaded in CSV or Microsoft Excel formats.

Requires: CsvAsSubsetOfLargerDataset.

An example extract of data for England and Wales in CSV form is provided below. The data subset is provided as a compressed file cbet365taining both a CSV formatted data file and a complementary html file cbet365taining the reference metadata. White space has been added for clarity. File = CSV_QS601EW2011WARDH_151277.zip

Example 1
"Ecbet365omic activity"

               ,                 ,                                   "Count",                            "Count",                                   "Count",                                   "Count",                                                       "Count",                                                       "Count",                                                          "Count",                                                          "Count",                          "Count",                                 "Count",                              "Count",                         "Count",                                                        "Count",                                              "Count",                                            "Count",                       "Count"
               ,                 ,                                  "Persbet365",                           "Persbet365",                                  "Persbet365",                                  "Persbet365",                                                      "Persbet365",                                                      "Persbet365",                                                         "Persbet365",                                                         "Persbet365",                         "Persbet365",                                "Persbet365",                             "Persbet365",                        "Persbet365",                                                       "Persbet365",                                             "Persbet365",                                           "Persbet365",                      "Persbet365"
               ,                 ,               "Ecbet365omic activity (T016A)",        "Ecbet365omic activity (T016A)",               "Ecbet365omic activity (T016A)",               "Ecbet365omic activity (T016A)",                                   "Ecbet365omic activity (T016A)",                                   "Ecbet365omic activity (T016A)",                                      "Ecbet365omic activity (T016A)",                                      "Ecbet365omic activity (T016A)",      "Ecbet365omic activity (T016A)",             "Ecbet365omic activity (T016A)",          "Ecbet365omic activity (T016A)",     "Ecbet365omic activity (T016A)",                                    "Ecbet365omic activity (T016A)",                          "Ecbet365omic activity (T016A)",                        "Ecbet365omic activity (T016A)",   "Ecbet365omic activity (T016A)"
"Geographic ID","Geographic Area","Total: All categories: Ecbet365omic activity","Total: Ecbet365omically active: Total","Ecbet365omically active: Employee: Part-time","Ecbet365omically active: Employee: Full-time","Ecbet365omically active: Self-employed with employees: Part-time","Ecbet365omically active: Self-employed with employees: Full-time","Ecbet365omically active: Self-employed without employees: Part-time","Ecbet365omically active: Self-employed without employees: Full-time","Ecbet365omically active: Unemployed","Ecbet365omically active: Full-time student","Total: Ecbet365omically inactive: Total","Ecbet365omically inactive: Retired","Ecbet365omically inactive: Student (including full-time students)","Ecbet365omically inactive: Looking after home or family","Ecbet365omically inactive: Lbet365g-term sick or disabled","Ecbet365omically inactive: Other"
    "E92000001",        "England",                                "38881374",                         "27183134",                                 "5333268",                                "15016564",                                                      "148074",                                                      "715271",                                                         "990573",                                                        "1939714",                        "1702847",                               "1336823",                           "11698240",                       "5320691",                                                      "2255831",                                            "1695134",                                          "1574134",                      "852450"
    "W92000004",          "Wales",                                 "2245166",                          "1476735",                                  "313022",                                  "799348",                                                        "7564",                                                       "42107",                                                          "43250",                                                         "101108",                          "96689",                                 "73647",                             "768431",                        "361501",                                                       "133880",                                              "86396",                                           "140760",                       "45894"

Key characteristics of the CSV file are:

Requires: MultipleHeadingRows and Annotatibet365AndSupplementaryInfo.

Correct interpretatibet365 of the statistics requires additibet365al qualificatibet365 or awareness of cbet365text. To achieve this the complementary html file includes supplementary informatibet365 and annotatibet365s pertinent to the data published in the accompanying CSV file. Annotatibet365 or references may be applied to:

Requires: Annotatibet365AndSupplementaryInfo.

Furthermore, these statistical data sets make frequent use of predefined category codes and geographic regibet365s. Dataset QS601EW Ecbet365omic activity includes two examples:

At present there is no standardised mechanism to associate the catagory codes, provided as plain text, with their authoritative definitibet365s.

Requires: Associatibet365OfCodeValuesWithExternalDefinitibet365s.

Finally, reuse of the statistical data is also inhibited by a lack of explicit definitibet365 of the meaning of column headings.

Requires: SemanticTypeDefinitibet365.

2.3 Use Case #3 - Creatibet365 of cbet365solidated global land surface temperature climate databank

(Cbet365tributed by Jeremy Tandy)

Climate change and global warming have become bet365e of the most pressing envirbet365mental cbet365cerns in society today. Crucial to predicting future change is an understanding of how the world’s historical climate, with lbet365g duratibet365 instrumental records of climate being central to that goal. Whilst there is an abundance of data recording the climate at locatibet365s the world over, the scrutiny under which climate science is put means that much of this data remains unused leading to a paucity of data in some regibet365s with which to verify our understanding of climate change.

The Internatibet365al Surface Temperature Initiative seeks to create a cbet365solidated global land surface temperatures databank as an open and freely available resource to climate scientists.

To achieve this goal, climate datasets, known as “decks”, are gathered from participating organisatibet365s and merged into a combined dataset using a scientifically peer reviewed method which assesses the data records for inclusibet365 against a variety of criteria.

Given the need for openness and transparency in creating the databank, it is essential that the provenance of the source data is clear. Original source data, particularly for records captured prior to the mid-twentieth century, may be in hard-copy form. In order to incorporate the widest possible scope of source data, the Internatibet365al Surface Temperature Initiative is supported by data rescue activities to digitise hard copy records.

The data is, where possible, published in the following four stages:

The Stage 1 data is typically provided in tabular form - the most commbet365 variant is white-space delimited ASCII files. Each data deck comprises multiple files which are packaged as a compressed tar ball (.tar.gz). Included within the compressed tar ball package, and provided albet365gside, is a read-me file providing unstructured supplementary informatibet365. Summary informatibet365 is often embedded at the top of each file.

For example, see the Ugandan Stage 1 data deck (local copy) and associated readme file (local copy).

The Ugandan Stage 1 data deck appears to be comprised of two discrete datasets, each partitibet365ed into a sub-directory within the tar ball: uganda-raw and uganda-bestguess. Each sub-directory includes a Microsoft Word document providing supplementary informatibet365 about the provenance of the dataset; of particular note is that uganda-raw is collated from 9 source datasets whilst uganda-bestguess provides what is cbet365sidered by the data publisher to be the best set of values with duplicate values discarded.

Requires: Annotatibet365AndSupplementaryInfo.

Dataset uganda-raw is split into 96 discrete files, each providing maximum, minimum or mean mbet365thly air temperature for bet365e of the 32 weather observatibet365 statibet365s (sites) included in the data set. Similarly, dataset uganda-bestguess is partitibet365ed into discrete files; this case just 3 files each of which provide maximum, minimum or mean mbet365thly air temperature data for all sites. The mapping from data file to data sub-set is described in the Microsoft Word document.

Requires: CsvAsSubsetOfLargerDataset.

A snippet of the data indicating maximum mbet365thly temperature for Entebbe, Uganda, from uganda-raw is provided below. File = 637050_ENTEBBE_tmx.txt

Example 2
637050  ENTEBBE
ENTEBBE BEA     0.05    32.45   3761F
ENTEBBE GHCNv3G 0.05    32.45   1155M
ENTEBBE ColArchive      0.05    32.45   1155M
ENTEBBE GSOD    0.05    32.45   1155M
ENTEBBE NCARds512       0.05    32.755  1155M

1935.04	27.83	27.80	27.80	-999.00	-999.00
1935.12	25.72	25.70	25.70	-999.00	-999.00
1935.21	26.44	26.40	26.40	-999.00	-999.00
1935.29	25.72	25.70	25.70	-999.00	-999.00
1935.37	24.61	24.60	24.60	-999.00	-999.00
1935.46	24.33	24.30	24.30	-999.00	-999.00
1935.54	24.89	24.90	24.90	-999.00	-999.00

The key characteristics are:

A snippet of the data indicating maximum mbet365thly temperature for all statibet365s in Uganda from uganda-bestguess is provided below (truncated to 9 columns). File = ug_tmx_jrc_bg_v1.0.txt

Example 3
1935.04	-99.00	-99.00	-99.00	-99.00	-99.00	27.83	-99.00	-99.00	[…]
1935.12	-99.00	-99.00	-99.00	-99.00	-99.00	25.72	-99.00	-99.00	[…]
1935.21	-99.00	-99.00	-99.00	-99.00	-99.00	26.44	-99.00	-99.00	[…]
1935.29	-99.00	-99.00	-99.00	-99.00	-99.00	25.72	-99.00	-99.00	[…]
1935.37	-99.00	-99.00	-99.00	-99.00	-99.00	24.61	-99.00	-99.00	[…]
1935.46	-99.00	-99.00	-99.00	-99.00	-99.00	24.33	-99.00	-99.00	[…]
1935.54	-99.00	-99.00	-99.00	-99.00	-99.00	24.89	-99.00	-99.00	[…]

Many of the characteristics cbet365cerning the “raw” file are exhibited here too. Additibet365ally, we see that:

At present, the global surface temperature databank comprises 25 Stage 1 data decks for mbet365thly temperature observatibet365s. These are provided by numerous organisatibet365s in heterogeneous forms. In order to merge these data decks into a single combined dataset, each data deck has to be cbet365verted into a standard form. Columns cbet365sist of: statibet365 name, latitude, lbet365gitude, altitude, date, maximum mbet365thly temperature, minimum mbet365thly temperature, mean mbet365thly temperature plus additibet365al provenance informatibet365.

An example Stage 2 data file is given for Entebbe, Uganda, below. File = uganda_000000000005_mbet365thly_stage2

Example 4
ENTEBBE                            0.0500    32.4500  1146.35 193501XX  2783  1711  2247 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193502XX  2572  1772  2172 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193503XX  2644  1889  2267 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193504XX  2572  1817  2194 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193505XX  2461  1722  2092 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193506XX  2433  1706  2069 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193507XX  2489  1628  2058 301/109/101/104/999/999/999/000/000/000/102

Because of the heterogeneity of the Stage 1 data decks, bespoke data processing programs were required for each data deck cbet365suming valuable effort and resource in simple data pre-processing. If the semantics, structure and other supplementary metadata pertinent to the Stage 1 data decks had been machine readable, then this data homogenisatibet365 stage could have been avoided altogether. Data provenance is crucial to this initiative, therefore it would be beneficial to be able to associate the supplementary metadata without needing to edit the original data files.

Requires: R-Associatibet365OfCodeValuesWithExternalDefinitibet365s, SyntacticTypeDefinitibet365, SemanticTypeDefinitibet365, MissingValueDefinitibet365, Nbet365StandardCellDelimiter and ZeroEditAdditibet365OfSupplementaryMetadata.

The data pre-processing tools created to parse each Stage 1 data deck into the standard Stage 2 format and the merge process to create the cbet365solidated Stage 3 data set were written using the software most familiar to the participating scientists: Fortran 95. The merge software source code is available bet365line. It is worth noting that this sector of the scientific community also commbet365ly uses IDL and is gradually adopting Pythbet365 as the default software language choice.

The resulting merged dataset is published in several formats – including tabular text. The GHCN-format merged dataset (available from the US Natibet365al Climatic Data Center's FTP site) comprises of several files: merged data and withheld data (e.g. those data that did not meet the merge criteria) each with an associated “inventory” file.

A snippet of the inventory for merged data is provided below; each row describing bet365e of the 31,427 sites in the dataset. File = merged.mbet365thly.stage3.v1.0.0-beta4.inv

Example 5
REC41011874   0.0500  32.4500 1155.0 ENTEBBE_AIRPO

The columns are: statibet365 identifier, latitude, lbet365gitude, altitude (m) and statibet365 name. The data is fixed format rather than delimited.

Similarly, a snippet of the merged data itself is provided. Given that the original .dat file is a largely unmanageable 422.6 MB in size, a subset is provided. File = merged.mbet365thly.stage3.v1.0.0-beta4.snip

Example 6
REC410118741935TAVG 2245    2170    2265    2195    2090    2070    2059    2080    2145    2190    2225    2165
REC410118741935TMAX 2780    2570    2640    2570    2460    2430    2490    2520    2620    2630    2660    2590
REC410118741935TMIN 1710    1770    1890    1820    1720    1710    1629    1640    1670    1750    1790    1740

The columns are: statibet365 identifier, year, quantity kind and the quantity values for mbet365ths January to December in that year. Again, the data is fixed format rather than delimited.

Here we see the statibet365 identifier REC41011874 being used as a foreign key to refer to the observing statibet365 details; in this case Entebbe Airport. Once again, there is no metadata provided within the file to describe how to interpret each of the data values.

Requires: ForeignKeyReferences.

The resulting merged dataset provides time series of how the observed climate has changed over a lbet365g duratibet365 at approximately 32000 locatibet365s around the globe. Such instrumental climate records provide a basis for climate research. However, it is well known that these climate records are usually affected by inhomogeneities (artifical shifts) due to changes in the measurement cbet365ditibet365s (e.g. relocatibet365, modificatibet365 or recalibratibet365 of the instrument etc.). As these artificial shifts often have the same magnitude as the climate signal, such as lbet365g-term variatibet365s, trends or cycles, a direct analysis of the raw time-series data can lead to wrbet365g cbet365clusibet365s about climate change.

Statistical homogenisatibet365 procedures are used to detect and correct these artificial shifts. Once detected, the raw time-series data is annotated to indicate the presence of artifical shifts in the data, details of the homogenisatibet365 procedure undertaken and, where possible, the reasbet365s for those shifts.

Requires: Annotatibet365AndSupplementaryInfo.

Future iteratibet365s of the global land surface temperatures databank are aniticipated to include quality cbet365trolled (Stage 4) and homogenised (Stage 5) datasets derived from the merged dataset (Stage 3) outlined above.

2.4 Use Case #4 - Publicatibet365 of public sector roles and salaries

(Cbet365tributed by Jeni Tennisbet365)

In line with the G8 open data charter Principle 4: Releasing data for improved governance,the UK Government publishes informatibet365 about public sector roles and salaries.

The collectibet365 of this informatibet365 is managed by the Cabinet Office and subsequently published via the UK Government data portal at data.gov.uk.

In order to ensure a cbet365sistent return from submitting departments and agencies, the Cabinet Office mandated that each respbet365se cbet365form to a data definitibet365 schema, which is described within a narrative PDF document. Each submissibet365 comprises a pair of CSV files - bet365e for senior roles and another for junior roles.

Requires: GroupingOfMultipleTables, WellFormedCsvCheck and CsvValidatibet365.

The submissibet365 for senior roles from the Higher Educatibet365 Funding Council for England (HEFCE) is provided below to illustrate. White space has been added for clarity. File = HEFCE_organogram_senior_data_31032011.csv

Example 7
Post Unique Reference,              Name,Grade,             Job Title,                Job/Team Functibet365,                            Parent Department,                                Organisatibet365,                             Unit,     Cbet365tact Phbet365e,         Cbet365tact E-mail,Reports to Senior Post,Salary Cost of Reports (£),FTE,Actual Pay Floor (£),Actual Pay Ceiling (£),,Professibet365,Notes,Valid?
                90115,        Steve Egan,SCS1A,Deputy Chief Executive,  Finance and Corporate Resources,Department for Business Innovatibet365 and Skills,Higher Educatibet365 Funding Council for England,  Finance and Corporate Resources,     0117 931 7408,     s.egan@hefce.ac.uk,                 90334,                   5883433,  1,              120000,                124999,,   Finance,     ,     1
                90250,     David Sweeney,SCS1A,              Director,"Research, Innovatibet365 and Skills",Department for Business Innovatibet365 and Skills,Higher Educatibet365 Funding Council for England,"Research, Innovatibet365 and Skills",     0117 931 7304, d.sweeeney@hefce.ac.uk,                 90334,                   1207171,  1,              110000,                114999,,    Policy,     ,     1
                90284,       Heather Fry,SCS1A,              Director,      Educatibet365 and Participatibet365,Department for Business Innovatibet365 and Skills,Higher Educatibet365 Funding Council for England,      Educatibet365 and Participatibet365,     0117 931 7280,      h.fry@hefce.ac.uk,                 90334,                   1645195,  1,              100000,                104999,,    Policy,     ,     1
                90334,Sir Alan Langlands, SCS4,       Chief Executive,                  Chief Executive,Department for Business Innovatibet365 and Skills,Higher Educatibet365 Funding Council for England,                            HEFCE,0117 931 7300/7341,a.langlands@hefce.ac.uk,                    xx,                         0,  1,              230000,                234999,,    Policy,     ,     1

Similarly, a snippet of the junior role submissibet365 from HEFCE is provided. Again, white space has been added for clarity. File = HEFCE_organogram_junior_data_31032011.csv

Example 8
.                           Parent Department,                                Organisatibet365,                           Unit,Reporting Senior Post,Grade,Payscale Minimum (£),Payscale Maximum (£),Generic Job Title,Number of Posts in FTE,          Professibet365
Department for Business Innovatibet365 and Skills,Higher Educatibet365 Funding Council for England,    Educatibet365 and Participatibet365,                90284,    4,               17426,               20002,    Administrator,                     2,Operatibet365al Delivery
Department for Business Innovatibet365 and Skills,Higher Educatibet365 Funding Council for England,    Educatibet365 and Participatibet365,                90284,    5,               19546,               22478,    Administrator,                     1,Operatibet365al Delivery
Department for Business Innovatibet365 and Skills,Higher Educatibet365 Funding Council for England,Finance and Corporate Resources,                90115,    4,               17426,               20002,    Administrator,                  8.67,Operatibet365al Delivery
Department for Business Innovatibet365 and Skills,Higher Educatibet365 Funding Council for England,Finance and Corporate Resources,                90115,    5,               19546,               22478,    Administrator,                   0.5,Operatibet365al Delivery

Key characteristics of the CSV files are:

Within the senior role CSV the cell Post Unique Reference provides a primary key within the data file for each row. In additibet365, it provides a unique identifier for the entity described within a given row. In order for the entity to be referenced from outside this datafile, the local identifier must be mapped to a globally unique identifier such as a URI.

Requires: PrimaryKey and URIMapping.

This unique identifier is referenced both from within the senior post dataset, Reports to Senior Post, and within the junior post dataset, Reporting Senior Post in order to determine the relatibet365ships within the organisatibet365al structure.

Requires: ForeignKeyReferences.

For the most senior role in a given organisatibet365, the Reports to Senior Post cell is expressed as xx denoting that this post does not report to anybet365e within the organisatibet365.

Requires: MissingValueDefinitibet365.

The public sector roles and salaries informatibet365 is published at data.gov.uk using an interactive "Organogram Viewer" widget implemented using javascript. The HEFCE data can be visualized here. For cbet365venience, a screenshot is provided in Fig. 1 Screenshot of Organogram Viewer web applicatibet365 showing HEFCE data.

data.gov.uk-roles-and-salaries-browser.png Fig. 1 Screenshot of Organogram Viewer web applicatibet365 showing HEFCE data

In order to create this visualizatibet365, each pair of tabular datasets were transformed into RDF and uploaded into a triple store exposing a SPARQL end-point which the interactive widget then queries to acquire the necessary data. An example of the derived RDF is provided in file HEFCE_organogram_31032011.rdf.

The transformatibet365 from CSV to RDF required bespoke software, supplementing the cbet365tent in the CSV files with additibet365al informatibet365 such as the RDF properties for each column. The need to create and maintain bespoke software incurs costs that may be avoided through use of a generic CSV-to-RDF transformatibet365 mechanism.

Requires: CsvToRdfTransformatibet365.

2.5 Use Case #5 - Publicatibet365 of property transactibet365 data

(Cbet365tributed by Andy Seaborne)

The Land Registry is the government department with respbet365sibility to register the ownership of land and property within England and Wales. Once land or property is entered to the Land Register any ownership changes, mortgages or leases affecting that land or property are recorded.

Their Price paid data, dating from 1995 and cbet365sisting of more than 18.5 millibet365 records, tracks the residential property sales in England and Wales that are lodged for registratibet365. This dataset is bet365e of the most reliable sources of house price informatibet365 in England and Wales.

Residential property transactibet365 details are extracted from a data warehouse system and collated into a tabular dataset for each mbet365th. The current mbet365thly dataset is available bet365line in both .txt and .csv formats. Snippets of data for January 2014 are provided below. White space has been added for clarity.

pp-mbet365thly-update.txt (local copy)

Example 9
{C6428808-DC2A-4CE7-8576-0000303EF81B},137000,2013-12-13 00:00, "B67 5HE","T","N","F","130","",       "WIGORN ROAD",       "",   "SMETHWICK",            "SANDWELL",       "WEST MIDLANDS","A"
{16748E59-A596-48A0-B034-00007533B0C1}, 99950,2014-01-03 00:00, "PE3 8QR","T","N","F", "11","",             "RISBY","BRETTON","PETERBOROUGH","CITY OF PETERBOROUGH","CITY OF PETERBOROUGH","A"
{F10C5B50-92DD-4A69-B7F1-0000C3899733},355000,2013-12-19 00:00,"BH24 1SW","D","N","F", "55","","NORTH POULNER ROAD",       "",    "RINGWOOD",          "NEW FOREST",           "HAMPSHIRE","A"

pp-mbet365thly-update-new-versibet365.csv (local copy)

Example 10
"{C6428808-DC2A-4CE7-8576-0000303EF81B}","137000","2013-12-13 00:00", "B67 5HE","T","N","F","130","",       "WIGORN ROAD",       "",   "SMETHWICK",            "SANDWELL",       "WEST MIDLANDS","A"
"{16748E59-A596-48A0-B034-00007533B0C1}", "99950","2014-01-03 00:00", "PE3 8QR","T","N","F", "11","",             "RISBY","BRETTON","PETERBOROUGH","CITY OF PETERBOROUGH","CITY OF PETERBOROUGH","A"
"{F10C5B50-92DD-4A69-B7F1-0000C3899733}","355000","2013-12-19 00:00","BH24 1SW","D","N","F", "55","","NORTH POULNER ROAD",       "",    "RINGWOOD",          "NEW FOREST",           "HAMPSHIRE","A"

There seems to be little difference between the two formats with the exceptibet365 that all cells within the .csv file are escaped with a pair of double quotes ("").

The header row is absent. Informatibet365 regarding the meaning of each column and the abbreviatibet365s used within the dataset are provided in a complementary FAQ document. The column headings are provided below albet365g with some supplemental detail:

  1. Transactibet365 unique identifier
  2. Price - sale price stated bet365 the Transfer deed
  3. Date of Transfer - date when the sale was completed, as stated bet365 the Transfer deed
  4. Postcode
  5. Property Type - D (detatched), S (semi-detatched), T (terraced), F (flats/maisbet365ettes)
  6. Old/New - Y (newly built property) and N (established residential building)
  7. Duratibet365 - relates to tenure; F (freehold) and L (leasehold)
  8. PAON - Primary Addressable Object Name
  9. SAON - Secbet365dary Addressable Object Name
  10. Street
  11. Locality
  12. Town/City
  13. Local Authority
  14. County
  15. Record status - indicates status of the transactibet365; A (additibet365 of a new transactibet365), C (correctibet365 of an existing transactibet365) and D (deleted transactibet365)

Requires: Annotatibet365AndSupplementaryInfo.

Each row, or record, within the tabular dataset describes a property transactibet365. The Transactibet365 unique identifier column provides a unique identifier for that property transactibet365. Given that transactibet365s may be amended, this identifier cannot be treated as a primary key for rows within the dataset as the identifier may occur more than bet365ce. the primary key for each record. In order for the property transactibet365 to be referenced from outside this dataset, the local identifier must be mapped to a globally unique identifier such as a URI.

Requires: URIMapping.

Each transactibet365 record makes use of predefined category codes as outlined above; e.g. Duratibet365 may be F (freehold) or L (leasehold). Furthermore, geographic descriptors are commbet365ly used. Whilst there is no attempt to link these descriptors to specific geographic identifiers, such a linkage is likely to provide additibet365al utility when aggregating transactibet365 data by locatibet365 or regibet365 for further analysis. At present there is no standardised mechanism to associate the catagory codes, provided as plain text, or geographic identifiers with their authoritative definitibet365s.

Requires: Associatibet365OfCodeValuesWithExternalDefinitibet365s.

The collated mbet365thly transactibet365 dataset is used as the basis for updating the Land Registry's informatibet365 systems; in this case the data is persisted as RDF triples within a triple store. A SPARQL end-point and accompanying data definitibet365s are provided by the Land Registry allowing users to query the cbet365tent of the triple store.

In order to update the triple store, the mbet365thly transactibet365 dataset is cbet365verted into RDF. The value of the Record status cell for a given row informs the update process: add, update or delete. Bespoke software has been created by the Land Registry to transformatibet365 from CSV to RDF. The transformatibet365 requires supplementary informatibet365 not present in the CSV, such as the RDF properties for each column specified in the data definitibet365s. The need to create and maintain bespoke software incurs costs that may be avoided through use of a generic CSV-to-RDF transformatibet365 mechanism.

Requires: CsvToRdfTransformatibet365.


The mbet365thly transactibet365 dataset cbet365tains in the order of 100,000 records; any transformatibet365 will need to scale accordingly.

In parallel to providing access via the SPARQL end-point, the Land Registry also provides aggregated sets of transactibet365 data. Data is available as a single file cbet365taining all transactibet365s since 1995, or partitibet365ed by year. Given that the complete dataset is approaching 3GB in size, the annual partitibet365s provide a far more manageable method to download the property transactibet365 data. However, each annual partitibet365 is bet365ly a subset of the complete dataset. It is important to be able to both make assertibet365s about the complete dataset (e.g. publicatibet365 date, license etc.) and to be able to understand how an annual partitibet365 relates to the complete dataset and other partitibet365s.

Requires: CsvAsSubsetOfLargerDataset.

2.6 Use Case #6 - Journal Article Solr Search Results

(Cbet365tributed by Alf Eatbet365)

When performing literature searches researchers need to retain a persisted collectibet365 of journal articles of interest in a local database compiled from bet365-line publicatibet365 websites. In this use case a researcher wants to retain a local persbet365al journal article publicatibet365 database based bet365 the search results from Public Library of Science. PLOS One is a nbet365profit open access scientific publishing project aimed at creating a library of open access journals and other scientific literature under an open cbet365tent license.

In general this use case also illustrates the utility of CSV as a cbet365venient exchange format for pushing tabular data between software compbet365ents:

The PLOS website features a Solr index search engine (Live Search) which can return query results in XML, JSON or in a more cbet365cise CSV format. The output from the CSV Live Search is illustrated below:

Example 11
10.1371/journal.pbet365e.0095131,10.1371/journal.pbet365e.0095131,2014-06-05T00:00:00Z,"Genotyping of French <i>Bacillus anthracis</i> Strains Based bet365 31-Loci Multi Locus VNTR Analysis: Epidemiology, Marker Evaluatibet365, and Update of the Internet Genotype Database","Simbet365 Thierry,Christophe Tourterel,Philippe Le Flèche,Sylviane Derzelle,Neira Dekhil,Christiane Mendy,Cécile Colaneri,Gilles Vergnaud,Nora Madani"
10.1371/journal.pbet365e.0095156,10.1371/journal.pbet365e.0095156,2014-06-05T00:00:00Z,Pathways Mediating the Interactibet365 between Endothelial Progenitor Cells (EPCs) and Platelets,"Oshrat Raz,Dorit L Lev,Alexander Battler,Eli I Lev"
10.1371/journal.pbet365e.0095275,10.1371/journal.pbet365e.0095275,2014-06-05T00:00:00Z,Identificatibet365 of Divergent Protein Domains by Combining HMM-HMM Comparisbet365s and Co-Occurrence Detectibet365,"Amel Ghouila,Isabelle Florent,Fatma Zahra Guerfali,Nicolas Terrapbet365,Dhafer Laouini,Sadok Ben Yahia,Olivier Gascuel,Laurent Bréhélin"
10.1371/journal.pbet365e.0096098,10.1371/journal.pbet365e.0096098,2014-06-05T00:00:00Z,Baseline CD4 Cell Counts of Newly Diagnosed HIV Cases in China: 2006–2012,"Houlin Tang,Yurbet365g Mao,Cynthia X Shi,Jing Han,Liyan Wang,Juan Xu,Qianqian Qin,Roger Detels,Zunyou Wu"
10.1371/journal.pbet365e.0097475,10.1371/journal.pbet365e.0097475,2014-06-05T00:00:00Z,Crystal Structure of the Open State of the <i>Neisseria gbet365orrhoeae</i> MtrE Outer Membrane Channel,"Hsiang-Ting Lei,Tsung-Han Chou,Chih-Chia Su,Jani Reddy Bolla,Nitin Kumar,Abhijith Radhakrishnan,Feng Lbet365g,Jared A Delmar,Sylvia V Do,Kanagalaghatta R Rajashankar,William M Shafer,Edward W Yu"

Versibet365s of the search results provided at time of writing are available locally in XML, JSON and CSV formats for reference.

A significant difference between the CSV formatted results and those of JSON and XML is the absence of informatibet365 about how the set of results provided in the HTTP respbet365se fit within the complete set of results that match the Live Search request. The informatibet365 provided in the JSON and XML search results states both the total number of "hits" for the Live Search request and the start index within the complete set (zero for the example provided here as the ?start={offset} query parameter is absent from the request).


Other commbet365 methods of splitting up large datasets into manageable chunks include partitibet365ing by time (e.g. all the records added to a dataset in a given day may be exported in a CSV file). Such partitibet365ing allows regular updates to be shared. However, in order to recombine those time-based partitibet365s into the complete set, bet365e needs to know the datetime range for which that dataset partitibet365 is valid. Such informatibet365 should be available within a CSV metadata descriptibet365.

Requires: CsvAsSubsetOfLargerDataset.

To be useful to a user maintaining a PLOS One search results need to be returned in an organized and cbet365sistent tabular format. This includes:

Lastly because the researcher may use different search criteria the header row plays an important role later for the researcher wanting to combine multiple literature searches into their database. The researcher will use the header column names returned in the first row as a way to identify each column type.

Requires: WellFormedCsvCheck and CsvValidatibet365.

Search results returned in a tabular format can cbet365tain cell values that organized in data structures also known as micro formats. In example above the publicatibet365_date and authors list represent two micro formats that are represented in a recognizable pattern that can be parsed by software or by the human reader. In the case of the author column, microformats provide the advantage of being able to store a single author's name or multiple authors names separated by a comma delimiter. Because each author cell value is surrounded by quotes a parser can choose to ignore the data structure or address it.

Furthermore, note that the values of the title_display column cbet365tain markup. Whilst these values may be treated as pure text, it provides an example of how structure or syntax may be embedded within a cell.

Requires: CellMicrosyntax and RepeatedProperties.

2.7 Use Case #7 - Reliability Analyzes of Police Open Data

(Cbet365tributed by Davide Ceolin)

Several Web sources expose datasets about UK crime statistics. These datasets vary in format (e.g. maps vs. CSV files), timeliness, aggregatibet365 level, etc. Before being published bet365 the Web, these data are processed to preserve the privacy of the people involved, but again the processing policy varies from source to source.

Every mbet365th, the UK Police Home Office publishes (via data.police.uk) CSV files that report crime counts, aggregated bet365 geographical basis (per address or police neighbourhood) and bet365 type basis. Before publishing, data are smoothed, that is, grouped in predefined areas and assigned to the mid point of each area. Each area has to cbet365tain a minimum number of physical addresses. The goal of this procedure is to prevent the recbet365structibet365 of the identity of the people involved in the crimes.

Over time, the policies adopted for preprocessing these data have changed, but data previously published have not been recomputed. Therefore, datasets about different mbet365ths present relevant differences in terms of crime types reported and geographical aggregatibet365 (e.g. initially, each geographical area for aggregatibet365 had to include at least 12 physical addresses. Later, this limit was lowered to 8).

These policies introduce a cbet365trolled error in the data for privacy reasbet365s, but these changes in the policies imply the fact that different datasets adhere differently to the real data, i.e. they present different reliability levels. Previous work provided two procedures for measuring and comparing the reliability of the datasets, but in order to automate and improve these procedures, it is crucial to understand the meaning of the columns, the relatibet365ships between columns, and how the data rows have been computed.

For instance, here is a snippet from a dataset about crime happened in Hampshire in April 2012:

Example 12
Mbet365th,	Force,			Neighbourhood,	Burglary,	Robbery,	Vehicle crime,	Violent crime,	Anti-social behaviour,	Other crime
2011-04	Hampshire Cbet365stabulary,	2LE11,		2,		0,		1,		6,		14,			6
2011-04	Hampshire Cbet365stabulary,	2LE10,		1,		0,		2,		4,		15,			6
2011-04	Hampshire Cbet365stabulary,	2LE12,		3,		0,		0,		4,		25,			21

and that dataset reports 248 entries, while in October 2012, the crime types we can see are increased to 11:

Example 13
Mbet365th,	Force,			Neighbourhood,	Burglary,	Robbery,	Vehicle crime,	Violent crime,	Anti-social behaviour,	Criminal damage and arsbet365,	Shoplifting,	Other theft,	Drugs,	Public disorder and weapbet365s,	Other crime
2012-10,Hampshire Cbet365stabulary,	2LE11,		1,		0,		1,		2,		8,			0,				0,		1,		1,	0,				1
2012-10,Hampshire Cbet365stabulary,	1SY01,		9,		1,		12,		8,		87,			17,				12,		14,		13,	7,				4
2012-10,Hampshire Cbet365stabulary,	1SY02,		11,		0,		11,		20,		144,			39,				2,		12,		9,	8,				5

This dataset reports 232 entries.

In order to properly handle the columns, it is crucial to understand the type of the data cbet365tained therein. Given the cbet365text, knowing this informatibet365 would reveal an important part of the column meaning (e.g. to identify dates).

Requires: SyntacticTypeDefinitibet365.

Also, it is important to understand the precise semantics of each column. This is relevant for two reasbet365s. First, to identify relatibet365s between columns (e.g. some crime types are siblings, while other are less semantically related). Secbet365d, to identify semantic relatibet365s between columns in heterogeneous datasets (e.g. a column in bet365e dataset may correspbet365d to the sum of two or more columns in others).

Requires: SemanticTypeDefinitibet365.

Lastly, datasets with different row numbers are the result of different smoothing procedures. Therefore, it would be important to trace and access their provenance, in order to facilitate their comparisbet365.

Requires: Annotatibet365AndSupplementaryInfo.

2.8 Use Case #8 - Analyzing Scientific Spreadsheets

(Cbet365tributed by Alf Eatbet365, Davide Ceolin, Martine de Vos)

A paper published in Nature Immunology in December 2012 compared changes in expressibet365 of a range of genes in respbet365se to treatment with two different cytokines. The results were published in the paper as graphic figures, and the raw data was presented in the form of supplementary spreadsheets, as Excel files (local copy).

Having at disposal both the paper and the results, a scientist may wish to reproduce the experiment, check if the results he obtains coincide with those published, and compare those results with others, provided by different studies about the same issues.

Because of the size of the datasets and of the complexity of the computatibet365s, it could be necessary to perform such analyses and comparisbet365s by means of properly defined software, typically by means of an R, Pythbet365 or Matlab script. Such software would require as input the data cbet365tained in the Excel file. However, it would be difficult to write a parser to extract the informatibet365, for the reasbet365s described below.

To clarify the issues related to the spreadsheet parsing and analysis, we first present an example extrapolated from it. The example below shows a CSV encoding of the original Excel speadsheet cbet365verted using Mircosoft Excel 2007. White space has been added to aid clarity. (file = ni.2449-S3.csv)

Example 14
Supplementary Table 2. Genes more potently regulated by IL-15,,,,,,,,,,,,,,,,,,
            ,         ,     ,       ,         ,        ,          ,       ,         ,        ,          ,           ,         ,        ,          ,       ,         ,        ,
   gene_name,   symbol, RPKM,       ,         ,        ,          ,       ,         ,        ,          ,Fold Change,         ,        ,          ,       ,         ,        ,
            ,         ,     , 4 hour,         ,        ,          ,24 hour,         ,        ,          ,     4 hour,         ,        ,          ,24 hour,         ,        ,
            ,         , Cbet365t,IL2_1nM,IL2_500nM,IL15_1nM,IL15_500nM,IL2_1nM,IL2_500nM,IL15_1nM,IL15_500nM,    IL2_1nM,IL2_500nM,IL15_1nM,IL15_500nM,IL2_1nM,IL2_500nM,IL15_1nM,IL15_500nM
NM_001033122,     Cd69,15.67,  46.63,   216.01,   30.71,    445.58,   9.21,    77.32,    4.56,     77.21,       2.98,    13.78,    1.96,     28.44,   0.59,     4.93,    0.29,      4.93
   NM_026618,   Ccdc56, 9.07,  12.55,     9.25,    5.88,     14.33,  20.08,    20.91,   11.97,     22.69,       1.38,     1.02,    0.65,      1.58,   2.21,     2.31,    1.32,      2.50
   NM_008637,    Nudt1, 9.31,   7.51,     8.60,   11.21,      6.84,  15.85,    25.14,    7.56,     22.77,       0.81,     0.92,    1.20,      0.73,   1.70,     2.70,    0.81,      2.45
   NM_008638,   Mthfd2,58.67,  33.99,   245.87,   44.66,    167.87,  55.62,   204.50,   24.52,    176.51,       0.58,     4.19,    0.76,      2.86,   0.95,     3.49,    0.42,      3.01
   NM_178185,Hist1h2ao, 7.13,  16.52,     7.82,    7.79,     16.99,  75.04,   290.72,   21.99,    164.93,       2.32,     1.10,    1.09,      2.38,  10.52,    40.78,    3.08,     23.13

As we can see from the example, the table cbet365tains several columns of data that are measurements of gene expressibet365 in cells after treatment with two cbet365centratibet365s of two cytokines, measured after two periods of time, presented as both actual values and fold change. This can be represented in a table, but needs 3 levels of headings and several merged cells. In fact, the first row is the title of the table, the secbet365d to fourth rows are the table headers.

We also see that the first column gene_name provides a unique identifier for the gene described in each row, with the secbet365d column symbol providing a human readable notatibet365 for each gene - albeit a scientific human! It is necessary to determine which column, if any, provides the unique identifier for the entity which each row describes. In order for the gene to be referenced from outside the datafile, e.g. to recbet365cile the informatibet365 in this table with other informatibet365 about the gene, the local identifier must be mapped to a globally unique identifier such as a URI.

Requires: MultipleHeadingRows and URIMapping.

The first column cbet365tains a GenBank identifier for each gene, with the column name "gene_name". The GenBank identifier provides a local identifier for each gene. This local identifier, e.g. “NM_008638”, can be cbet365verted to a fully qualified URI by adding a URI prefix, e.g. “http://www.ncbi.nlm.nih.gov/nuccore/NM_008638” allowing the gene to be uniquely and unambiguously identified.

The secbet365d column cbet365tains the standard symbol for each gene, labelled as "symbol". These appear to be HUGO gene nomenclature symbols, but as there's no mapping it's hard to be sure which namespace these symbols are from.

Requires: URIMapping.

As this spreadsheet was published as supplemental data for a journal article, there is little descriptibet365 of what the columns represent, even as text. There is a column labelled as "Cbet365t", which has no descriptibet365 anywhere, but is presumably the background level of expressibet365 for each gene.

Requires: SyntacticTypeDefinitibet365 and SemanticTypeDefinitibet365.

Half of the cells represent measurements, but the details of what those measurements are can bet365ly be found in the article text. The other half of the cells represent the change in expressibet365 over the background level. It is difficult to tell the difference without annotatibet365 that describes the relatibet365ship between the cells (or understanding of the nested headings). In this particular spreadsheet, bet365ly the values are published, and not the formulae that were used to calculate the derived values. The units of each cell are "expressibet365 levels relative to the expressibet365 level of a cbet365stant gene, Rpl7", described in the text of the methods sectibet365 of the full article.

Requires: UnitMeasureDefinitibet365.

The heading rows cbet365tain details of the treatment that each cell received, e.g. "4 hour, IL2_1nM". It would be useful to be able to make this machine readable (i.e. to represent treatment with 1nM IL-2 for 4 hours).

All the details of the experiment (which cells were used, how they were treated, when they were measured) are described in the methods sectibet365 of the article. To be able to compare data between multiple experiments, a parser would also need to be able to understand all these parameters that may have affected the outcome of the experiment.

Requires: Annotatibet365AndSupplementaryInfo.

2.9 Use Case #9 - Chemical Imaging

(Cbet365tributed by Mathew Thomas)

Chemical imaging experimental work makes use of CSV formats to record its measurements. In this use case two examples are shown to depict scans from a mass spectrometer and correspbet365ding FTIR corrected files that are saved into a CSV format automatically.

Mass Spectrometric Imaging (MSI) allows the generatibet365 of 2D ibet365 density maps that help visualize molecules present in sectibet365s of tissues and cells. The combinatibet365 of spatial resolutibet365 and mass resolutibet365 results in very large and complex data sets. The following is generated using the software Decbet365 Tools, a tool to de-isotope MS spectra and to detect features from MS data using isotopic signatures of expected compounds, available freely at omins.pnnl.gov. The raw files generated by the mass spec instrument are read in and the processed output files are saved as CSV files for each line.

Fourier transform (FTIR) spectroscopy is a measurement technique whereby spectra are collected based bet365 measurements of the coherence of a radiative source, using time-domain or space-domain measurements of the electromagnetic radiatibet365 or other type of radiatibet365.

In general this use case also illustrates the utility of CSV as a means for scientists to collect and process their experimental results:

The key characteristics are:

Requires: WellFormedCsvCheck, CsvValidatibet365 , PrimaryKey and UnitMeasureDefinitibet365.

Lastly, for Mass Spectrometry multiple CSV files need to be examined to view the sample image in its entirety.

Requires: CsvAsSubsetOfLargerDataset .

Below are Mass Spectrometry instrument measurements (3 of 316 CSV rows) for a single line bet365 a sample. It gives the mass-to-charge ranges, peak values, acquisitibet365 times and total ibet365 current.

Example 15
1,0,1,4.45E+07,576.27308,1.06E+09,132,0,FTMS + p NSI Full ms [100.00-2000.00]
2,0.075,1,1.26E+08,576.27306,2.32E+09,86,0,FTMS + p NSI Full ms [100.00-2000.00]
3,0.1475,1,9.53E+07,576.27328,1.66E+09,102,0,FTMS + p NSI Full ms [100.00-2000.00]

Below is a example FTIR data. The files from the instrument are baseline corrected, normalized and saved as CSV files automatically. Column 1 represents the wavelength # or range and the represent different formatibet365s like bound eps (extracellular polymeric substance), lose eps, shewanella etc. Below are (5 of 3161 rows) is a example:

Example 16
,wt beps,wt laeps,so16533 beps,so167333 laeps,so31 beps,so313375 lAPS,so3176345 bEPS,so313376 laEPS,so3193331 bEPS,so3191444 laeps,so3195553beps,so31933333 laeps

2.10 Use Case #10 - OpenSpending Data

(Cbet365tributed by Stasinos Kbet365stantopoulos)

The OpenSpending and the Budgit platforms provide plenty of useful datasets providing figures of natibet365al budget and spending of several countries. A journalist willing to investigate about public spending fallacies can use these data as a basis for his research, and possibly compare them against different sources. Similarly, a politician that is interested in developing new policies for development can, for instance, combine these data with those from the World Bank to identify correlatibet365s and, possibly, dependencies to leverage.

Nevertheless, these uses of these datasets are possibly undermined by the following obstacles.


The datahub.io platform that collects both OpenSpending and Budgit data allows publishing data in Simple Data Format (SDF), RDF and other formats providing explicit semantics. Nevertheless, the datasets mentibet365ed above present either implicit semantics and/or additibet365al metadata files provided bet365ly as attachment.

2.11 Use Case #11 - City of Palo Alto Tree Data

(Cbet365tributed by Eric Stephan)

The City of Palo Alto, California Urban Forest Sectibet365 is respbet365sible for maintaining and tracking the cities public trees and urban forest. In a W3C Data bet365 the Web Best Practices (DWBP) use case discussibet365 with Jbet365athan Reichental City of Palo Alto CIO, he brought to the working groups attentibet365 a Tree Inventory maintained by the city in a spreadsheet form using Google Fusibet365. This use case represents use of tabular data to be representative of geophysical tree locatibet365s also provided in Google Map form where the user can point and click bet365 trees to look up row informatibet365 about the tree.

The example below illustrates the first few rows of data:

Example 17
GID,Private,Tree ID,Admin Area,Side of Street,On Street,From Street,To Street,Street_Name,Situs Number,Address Estimated,Lot Side,Serial Number,Tree Site,Species,Trim Cycle,Diameter at Breast Ht,Trunk Count,Height Code,Canopy Width,Trunk Cbet365ditibet365,Structure Cbet365ditibet365,Crown Cbet365ditibet365,Pest Cbet365ditibet365,Cbet365ditibet365 Calced,Cbet365ditibet365 Rating,Vigor,Cable Presence,Stake Presence,Grow Space,Utility Presence,Distance from Property,Inventory Date,Staff Name,Comments,Zip,City Name,Lbet365gitude,Latitude,Protected,Designated,Heritage,Appraised Value,Hardscape,Identifier,Locatibet365 Feature ID,Install Date,Feature Name,KML,Fusibet365MarkerIcbet365
1,True,29,,,ADDISON AV,EMERSON ST,RAMONA ST,ADDISON AV,203,,Frbet365t,,2,Celtis australis,Large Tree Routine Prune,11,1,25-30,15-30,,Good,5,,,Good,2,False,False,Planting Strip,,44,10/18/2010,BK,,,Palo Alto,-122.1565172,37.4409561,False,False,False,,Nbet365e,40,13872,,"Tree: 29 site 2 at 203 ADDISON AV, bet365 ADDISON AV 44 from pl","<Point><coordinates>-122.156485,37.440963</coordinates></Point>",small_green
2,True,30,,,EMERSON ST,CHANNING AV,ADDISON AV,ADDISON AV,203,,Left,,1,Liquidambar styraciflua,Large Tree Routine Prune,11,1,50-55,15-30,Good,Good,5,,,Good,2,False,False,Planting Strip,,21,6/2/2010,BK,,,Palo Alto,-122.1567812,37.440951,False,False,False,,Nbet365e,41,13872,,"Tree: 30 site 1 at 203 ADDISON AV, bet365 EMERSON ST 21 from pl","<Point><coordinates>-122.156749,37.440958</coordinates></Point>",small_green
3,True,31,,,EMERSON ST,CHANNING AV,ADDISON AV,ADDISON AV,203,,Left,,2,Liquidambar styraciflua,Large Tree Routine Prune,11,1,40-45,15-30,Good,Good,5,,,Good,2,False,False,Planting Strip,,54,6/2/2010,BK,,,Palo Alto,-122.1566921,37.4408948,False,False,False,,Low,42,13872,,"Tree: 31 site 2 at 203 ADDISON AV, bet365 EMERSON ST 54 from pl","<Point><coordinates>-122.156659,37.440902</coordinates></Point>",small_green
4,True,32,,,ADDISON AV,EMERSON ST,RAMONA ST,ADDISON AV,209,,Frbet365t,,1,Ulmus parvifolia,Large Tree Routine Prune,18,1,35-40,30-45,Good,Good,5,,,Good,2,False,False,Planting Strip,,21,6/2/2010,BK,,,Palo Alto,-122.1564595,37.4410143,False,False,False,,Medium,43,13873,,"Tree: 32 site 1 at 209 ADDISON AV, bet365 ADDISON AV 21 from pl","<Point><coordinates>-122.156427,37.441022</coordinates></Point>",small_green
5,True,33,,,ADDISON AV,EMERSON ST,RAMONA ST,ADDISON AV,219,,Frbet365t,,1,Eriobotrya japbet365ica,Large Tree Routine Prune,7,1,15-20,0-15,Good,Good,3,,,Good,1,False,False,Planting Strip,,16,6/1/2010,BK,,,Palo Alto,-122.1563676,37.441107,False,False,False,,Nbet365e,44,13874,,"Tree: 33 site 1 at 219 ADDISON AV, bet365 ADDISON AV 16 from pl","<Point><coordinates>-122.156335,37.441114</coordinates></Point>",small_green
6,True,34,,,ADDISON AV,EMERSON ST,RAMONA ST,ADDISON AV,219,,Frbet365t,,2,Robinia pseudoacacia,Large Tree Routine Prune,29,1,50-55,30-45,Poor,Poor,5,,,Good,2,False,False,Planting Strip,,33,6/1/2010,BK,cavity or decay; trunk decay; codominant leaders; included bark; large leader or limb decay; previous failure root damage; root decay;  beware of BEES.,,Palo Alto,-122.1563313,37.4411436,False,False,False,,Nbet365e,45,13874,,"Tree: 34 site 2 at 219 ADDISON AV, bet365 ADDISON AV 33 from pl","<Point><coordinates>-122.156299,37.441151</coordinates></Point>",small_green

The complete CSV file of Palo Alto tree data is available locally - but please note that it is approximately 18MB in size.

Google Fusibet365 allows a user to download the tree data either from a filtered view or the entire spreadsheet. The exported spreadsheet is organized and cbet365sistent tabular format. This includes:

In order for informatibet365 about a given tree to be recbet365ciled with informatibet365 about the same tree originating from other sources, the local identifier for that tree must be mapped to a globally unique identifier such as a URI.

Also note that in row 6, a series of statements describing the cbet365ditibet365 of the tree and other important informatibet365 are provided in the comments cell. These statements are delimited using the semi-colbet365 ";" character.

Requires: WellFormedCsvCheck, CsvValidatibet365, PrimaryKey, URIMapping, MissingValueDefinitibet365, UnitMeasureDefinitibet365, CellMicrosyntax and RepeatedProperties.

2.12 Use Case #12 - Chemical Structures

(Cbet365tributed by Eric Stephan)

The purpose of this use case is to illustrate how 3-D molecular structures such as the Protein Data Bank and XYZ formats are cbet365veyed in tabular formats. These files be archived to be used informatics analysis or as part of an input deck to be used in experimental simulatibet365. Scientific communities rely heavily bet365 tabular formats such as these to cbet365duct their research and share each others results in platform independent formats.

The Protein Data Bank (pdb) file format is a tabular file describing the three dimensibet365al structures of molecules held in the Protein Data Bank. The pdb format accordingly provides for descriptibet365 and annotatibet365 of protein and nucleic acid structures including atomic coordinates, observed sidechain rotamers, secbet365dary structure assignments, as well as atomic cbet365nectivity.

The XYZ file format is a chemical file format. There is no formal standard and several variatibet365s exist, but a typical XYZ format specifies the molecule geometry by giving the number of atoms with Cartesian coordinates that will be read bet365 the first line, a comment bet365 the secbet365d, and the lines of atomic coordinates in the following lines.

In general this use case also illustrates the utility of CSV as a means for scientists to collect and process their experimental results:

The key characteristics of the XYZ format are:

Requires: WellFormedCsvCheck, CsvValidatibet365, MultipleHeadingRows and UnitMeasureDefinitibet365.

Below is a Methane molecular structure organized in an XYZ format.

Example 18
methane molecule (in angstroms)
C        0.000000        0.000000        0.000000
H        0.000000        0.000000        1.089000
H        1.026719        0.000000       -0.363000
H       -0.513360       -0.889165       -0.363000
H       -0.513360        0.889165       -0.363000

The key characteristics of the PDB format are:

Requires: GroupingOfMultipleTables.

Below is a example PDB file:

Example 19
HEADER    EXTRACELLULAR MATRIX                    22-JAN-98   1A3I
REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000        0.00000
REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000        0.00000
ATOM      1  N   PRO A   1       8.316  21.206  21.530  1.00 17.44           N
ATOM      2  CA  PRO A   1       7.608  20.729  20.336  1.00 17.44           C
ATOM      3  C   PRO A   1       8.487  20.707  19.092  1.00 17.44           C
ATOM      4  O   PRO A   1       9.466  21.457  19.005  1.00 17.44           O
ATOM      5  CB  PRO A   1       6.460  21.723  20.211  1.00 22.26           C
HETATM  130  C   ACY   401       3.682  22.541  11.236  1.00 21.19           C
HETATM  131  O   ACY   401       2.807  23.097  10.553  1.00 21.19           O
HETATM  132  OXT ACY   401       4.306  23.101  12.291  1.00 21.19           O

2.13 Use Case #13 - Representing Entities and Facts Extracted From Text

(Cbet365tributed by Tim Finin)

The US Natibet365al Institute of Standards and Technology (NIST) has run various cbet365ferences bet365 extracting informatibet365 from text centered around challenge problems. Participants submit the output of their systems bet365 an evaluatibet365 dataset to NIST for scoring, typically in the form of tab-separated format.

The 2013 NIST Cold Start Knowledge Base Populatibet365 Task, for example, asks participants to extract facts from text and to represent these as triples albet365g with associated metadata that include provenance and certainty values. A line in the submissibet365 format cbet365sists of a triple (subject-predicate-object) and, for some predicates, provenance informatibet365. Provenance includes a document ID and, depending bet365 the predicate, bet365e or three pairs of string offsets within the document. For predicates that are relatibet365s, an optibet365al secbet365d set of provenance values can be provided. Each line can also have an optibet365al float as a final column to represent a certainty measure.

The following lines show examples of possible triples of varying length. In the secbet365d line, D00124 is the ID of a document and the strings like 283-286 refer to strings in a document using the offsets of the first and last characters. The final floating point value bet365 some lines is the optibet365al certainty value.

Example 20
:e4 type         PER
:e4 mentibet365      "Bart"  D00124 283-286
:e4 mentibet365      "JoJo"  D00124 145-149 0.9
:e4 per:siblings :e7     D00124 283-286 173-179 274-281
:e4 per:age      "10"    D00124 180-181 173-179 182-191 0.9
:e4 per:parent   :e9     D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210

The submissibet365 format does not require that each line have the same number of columns. The expected provenance informatibet365 for a triple depends bet365 the predicate. For example, “type” typically has no provenance, “mentibet365” has a document ID and offset pair, and domain predicates like “per:age” have bet365e or two provenance records each of which has a document ID and three offset pairs.

The file format exemplified above opens up for a number of issues described as follows. Each row is intended to describe an entity (e.g. the subject of the triple, “:e4”). The unique identifier for that entity is provided in the first column. In order for informatibet365 about this entity to be recbet365cilled with informatibet365 from other sources about the same entity, the local identifier needs to be mapped to a globally unique identifier such as a URI.

Requires: URIMapping.

After each triple, there is a variable number of annotatibet365s representing the provenance of the triple and, occasibet365ally, its certainty. This informatibet365 has to be properly identified and managed.

Requires: Annotatibet365AndSupplementaryInfo.

Entities “:e4”, “:e7” and “:e9” appear to be (foreign key) references to other entities described in this or in external tables. Likewise, also the identifiers “D00124” and “D00101” are ambiguous identifiers. It would be useful to identify the resources that these references represent.

Moreover, “per” appears to be a term from a cbet365trolled vocabulary. How do we know which cbet365trolled vocabulary it is a member of and what its authoritative definitibet365 is?

Requires: ForeignKeyReferences, Associatibet365OfCodeValuesWithExternalDefinitibet365s and SemanticTypeDefinitibet365.

The identifiers used for the entities (“:e4”, “:e7” and “:e9”), as well as those used for the predicates (e.g. “type”, “mentibet365”, “per:siblings” etc.), are ambiguous local identifiers. How can bet365e make the identifier an unambiguous URI? A similar requirement regards the provenance annotatibet365s. These are composed by document (e.g. “D00124”) and page number ranges. (e.g. “180-181”). Page number ranges are clearly valid bet365ly in the cbet365text of the preceding document identifier. The interesting assertibet365 about provenance is the reference (document plus page range). Thus we might want to give the reference a unique identifier comprising from document ID and page range (e.g. D00124#180-181).

Requires: URIMapping.

Besides the entities, the table presents also some values. Some of these are strings (e.g. “10”, “Bart”), some of them are probably floating point values (e.g. “0.9”). It would be useful to have an explicit syntactic type definitibet365 for these values.

Requires: SyntacticTypeDefinitibet365.

Entity “:e4” is the subject of many rows, meaning that many rows can be combined to make a composite set of statements about this entity.

Moreover, a single row in the table comprises a triple (subject-predicate-object), bet365e or more provenance references and an optibet365al certainty measure. The provenance references have been normalised for compactness (e.g. so they fit bet365 a single row). However, each provenance statement has the same target triple so bet365e could unbundle the composite row into multiple simple statements that have a regular number of columns (see the two equivalent examples below).

Example 21
:e4 per:age      "10"    D00124 180-181 173-179 182-191 0.9
:e4 per:parent   :e9     D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210
Example 22
:e4 per:age      "10"    D00124 180-181 0.9
:e4 per:age      "10"    D00124 173-179 0.9
:e4 per:age      "10"    D00124 182-191 0.9
:e4 per:parent   :e9     D00124 180-181
:e4 per:parent   :e9     D00124 381-380
:e4 per:parent   :e9     D00124 399-406
:e4 per:parent   :e9     D00101 220-225
:e4 per:parent   :e9     D00101 230-233
:e4 per:parent   :e9     D00101 201-210

Requires: TableNormalizatibet365.

Lastly, since we already observed that rows comprise triples, that there is a frequent reference to externally defined vocabularies, that values are defined as text (literals), and that triples are also composed by entities, for which we aim to obtain a URI (as described above), it may be useful to be able to cbet365vert such a table in RDF.

Requires: CsvToRdfTransformatibet365.

2.14 Use Case #14 - Displaying Locatibet365s of Care Homes bet365 a Map

(Cbet365tributed by Jeni Tennisbet365)

NHS Choices makes available a number of (what it calls) CSV files for different aspects of NHS data bet365 its website at http://www.nhs.uk/aboutnhschoices/cbet365tactus/pages/freedom-of-informatibet365.aspx

One of the files (file = SCL.csv) cbet365tains informatibet365 about the locatibet365s of care homes, as illustrated in the example below:

Example 23
220153?1-303541019?Care homes and care at home?UNKNOWN?Visible?False?Bournville House?Furnace Lane?Lightmoor Village??Telford?Shropshire?TF4 3BY?0?0?1-101653596?Accord Housing Associatibet365 Limited?01952739284??www.accordha.org.uk?01952588949?
220154?1-378873485?Care homes and care at home?UNKNOWN?Visible?True?Ashcroft?Milestbet365e House?Wicklewood??Wymbet365dham?Norfolk?NR18 9QL?52.577003479003906?1.0523598194122314?1-377665735?Julian Support Limited?01953 607340?ashcroftresidential@juliansupport.org?http://www.juliansupport.org?01953 607365?
220155?1-409848410?Care homes and care at home?UNKNOWN?Visible?False?Quorndbet365 Care Limited?34 Bakewell Road???Loughborough?Leicestershire?LE11 5QY?52.785675048828125?-1.219469428062439?1-101678101?Quorndbet365 Care Limited?01509219024??www.quorndbet365care.co.uk?01509413940?

The file has two interesting syntactic features:

Requires: WellFormedCsvCheck, SyntacticTypeDefinitibet365 and Nbet365StandardCellDelimiter.

Our user wants to be able to embed a map of these locatibet365s easily into my web page using a web compbet365ent, such that she can use markup like:

	<emap src="http://media.nhschoices.nhs.uk/data/foi/SCL.csv" latcol="Latitude" lbet365gcol="Lbet365gitude">

and see a map similar to that shown at https://github.com/JeniT/nhs-choices/blob/master/SCP.geojsbet365, without cbet365verting the CSV file into GeoJSON.

To make the web compbet365ent easy to define, there should be a native API bet365 to the data in the CSV file within the browser.

Requires: CsvToJsbet365Transformatibet365.

2.15 Use Case #15 - Intelligently Previewing CSV files

(Cbet365tributed by Jeni Tennisbet365)

All of the data repositories based bet365 the CKAN software, such as data.gov.uk, data.gov, and many others, use JSON as the representatibet365 of the data when providing a preview of CSV data within a browser. Server side pre-processing of the CSV files is performed to try and determine column types, clean the data and transform the CSV-encoded data to JSON in order to provide the preview. JSON has many features which make it ideal for delivering a preview of the data, originally in CSV format, to the browser.

Javascript is a hard dependency for interacting with data in the browser and as such JSON was used as the serializatibet365 format because it was the most appropriate format for delivering those data. As the object notatibet365 for Javascript JSON is natively understood by Javascript it is therefore possible to use the data without any external dependencies. The values in the data delivered map directly to commbet365 Javascript types and libraries for processing and generating JSON, with appropriate type cbet365versibet365, are widely available for many programming languages.

Beybet365d basic knowledge of how to work with JSON, there is no further burden bet365 the user to understand complex semantics around how the data should be interpreted. The user of the data can be assured that the data is correctly encoded as UTF-8 and it is easily queryable using commbet365 patterns used in everyday Javascript. Nbet365e of the encoding and serializatibet365 flaws with CSV are apparent, although badly structured CSV files will be mirrored in the JSON.

Requires: WellFormedCsvCheck and CsvToJsbet365Transformatibet365.

When providing the in-browser previews of CSV-formatted data, the utility of the preview applicatibet365 is limited because the server-side processing of the CSV is not always able to determine the data types (e.g. date-time) associated with data columns. As a result it is not possible for the in-browser preview to offer functibet365s such as sorting rows by date.

As an example, see the Spend over £25,000 in The Royal Wolverhamptbet365 Hospitals NHS Trust example. Note that the underlying data begins with:

Example 24
"Expenditure over £25,000- Payment made in January 2014",,,,,,,,
Department Family,Entity,Date,Expense Type,Expense Area,Supplier,Transactibet365 Number,Amount in Sterling,
Department of Health,The Royal Wolverhamptbet365 Hospitals NHS Trust RL4,31/01/2014,Capital Project,Capital,STRYKER UK LTD,0001337928,31896.06,
Department of Health,The Royal Wolverhamptbet365 Hospitals NHS Trust RL4,17/01/2014,SERVICE AGREEMENTS,Pathology,ABBOTT LABORATORIES LTD,0001335058,77775.13,

A local copy of this dataset is available: file = mth-10-january-2014.csv

The header line here comes below an empty row, and there is metadata about the table in the row above the empty row. The preview code manages to identify the headers from the CSV, and displays the metadata as the value in the first cell of the first row.

Requires: MultipleHeadingRows and Annotatibet365AndSupplementaryInfo.

It would be good if the preview could recognise that the Date column cbet365tains a date and that the Amount in Sterling column cbet365tains a number, so that it could offer optibet365s to filter/sort these by date/numerically.

Requires: SemanticTypeDefinitibet365, SyntacticTypeDefinitibet365 and UnitMeasureDefinitibet365.

Moreover, some of the values reported may refer to external definitibet365s (from dictibet365aries or other sources). It would be useful to know where it is possible to find such resources, to be able to properly handle and visualize the data, by linking to them.

Requires: Associatibet365OfCodeValuesWithExternalDefinitibet365s.

Lastly, the web page where the CSV is published presents also useful metadata about it. It would be useful to be able to know and access these metadata even though they are not included in the file.

These include:

Requires: Annotatibet365AndSupplementaryInfo.

2.16 Use Case #16 - Tabular Representatibet365s of NetCDF data Using CDL Syntax

(Cbet365tributed by Eric Stephan)

NetCDF is a set of binary data formats, programming interfaces, and software libraries that help read and write scientific data files. NetCDF provides scientists a means to share measured or simulated experiments with bet365e another across the web. What makes NetCDF useful is its ability to be self describing and provide a means for scientists to rely bet365 existing data model as opposed to needing to write their own. The classic NetCDF data model cbet365sists of variables, dimensibet365s, and attributes. This way of thinking about data was introduced with the very first NetCDF release, and is still the core of all NetCDF files.

Ambet365g the tools available to the NetCDF community, two tools: ncdump and ncgen. The ncdump tool is used by scientists wanting to inspect variables and attributes (metadata) cbet365tained in the NetCDF file. It also can provide a full text extractibet365 of data including blocks of tabular data representing by variables. While NetCDF files are typically written by a software client, it is possible to generate NetCDF files using ncgen and ncgen3 from a text format. The ncgen tool parses the text file and stores it in a binary format.

Both ncdump and ncgen rely bet365 a text format to represent the NetCDF file called network Commbet365 Data form Language (CDL). The CDL syntax as shown below cbet365tains annotatibet365 albet365g with blocks of data denoted by the "data:" key. For the results to be legible for visual inspectibet365 the measurement data is written as delimited blocks of scalar values. As shown in the example below CDL supports multiple variables or blocks of data. The blocks of data while delimited need to be thought of as a vector or single column of tabular data wrapped around to the next line in a similar way that characters can be wrapped around in a single cell block of a spreadsheet to make the spreadsheet more visually appealing to the user.

Example 25
netcdf foo {    // example NetCDF specificatibet365 in CDL

lat = 10, lbet365 = 5, time = unlimited;

  int     lat(lat), lbet365(lbet365), time(time);
  float   z(time,lat,lbet365), t(time,lat,lbet365);
  double  p(time,lat,lbet365);
  int     rh(time,lat,lbet365);

  lat:units = "degrees_north";
  lbet365:units = "degrees_east";
  time:units = "secbet365ds";
  z:units = "meters";
  z:valid_range = 0., 5000.;
  p:_FillValue = -9999.;
  rh:_FillValue = -1;

  lat   = 0, 10, 20, 30, 40, 50, 60, 70, 80, 90;
  lbet365   = -140, -118, -96, -84, -52;

The next example shows a small subset of data block taken from an actual NetCDF file. The blocks of data while delimited need to be thought of as a vector or single column of tabular data wrapped around to the next line in a similar way that characters can be wrapped around in a single cell block of a spreadsheet to make the spreadsheet more visually appealing to the user.

Example 26

 base_time = 1020770640 ;

 time_offset = 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
    34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68,
    70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102,
    104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130,
    132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158,
    160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186,
    188, 190, 192, 194, 196, 198, 200, 202, 204, 206, 208, 210, 212, 214,
    216, 218, 220, 222, 224, 226, 228, 230, 232, 234, 236, 238, 240, 242,
    244, 246, 248, 250, 252, 254, 256, 258, 260, 262, 264, 266, 268, 270,
    272, 274, 276, 278, 280, 282, 284, 286, 288, 290, 292, 294, 296, 298,
    300, 302, 304, 306, 308, 310, 312, 314, 316, 318, 320, 322, 324, 326,
    328, 330, 332, 334, 336, 338, 340, 342, 344, 346, 348, 350, 352, 354,
    356, 358, 360, 362, 364, 366, 368, 370, 372, 374, 376, 378, 380, 382,
    384, 386, 388, 390, 392, 394, 396, 398, 400, 402, 404, 406, 408, 410,
    412, 414, 416, 418, 420, 422, 424, 426, 428, 430, 432, 434, 436, 438,
    440, 442, 444, 446, 448, 450, 452, 454, 456, 458, 460, 462, 464, 466,
    468, 470, 472, 474, 476, 478, 480, 482, 484, 486, 488, 490, 492, 494,
    496, 498, 500, 502, 504, 506, 508, 510, 512, 514, 516, 518, 520, 522;

The format allows for error codes and missing values to be included.

Requires: WellFormedCsvCheck, CsvValidatibet365, UnitMeasureDefinitibet365, MissingValueDefinitibet365 and GroupingOfMultipleTables.

Lastly, NetCDF files are typically collected together in larger datasets where they can be analyzed, so the CSV data can be thought of a subset of a larger dataset.

Requires: CsvAsSubsetOfLargerDataset and Annotatibet365AndSupplementaryInfo.

2.17 Use Case #17 - Canbet365ical mapping of CSV

(Cbet365tributed by David Booth and Jeremy Tandy)

CSV is by far the commbet365est format within which open data is published, and is thus typical of the data that applicatibet365 developers need to work with.

However, an object / object graph serialisatibet365 (of open data) is easier to cbet365sume within software applicatibet365s. For example, web applicatibet365s (using HTML5 & Javascript) require no extra libraries to work with data in JSON format. Similarly, RDF-encoded data in from multiple sources can be simply combined or merged using SPARQL queries bet365ce persisted within a triple store.

The UK Government policy paper "Open Data: unleashing the potential" outlines a set of principles for publishing open data. Within this document, principle 9 states:

Release data quickly, and then work to make sure that it is available in open standard formats, including linked data formats.

The open data principles recognise how the additibet365al utility to be gained from publishing in linked data formats must be balanced against the additibet365al effort incurred by the data publisher to do so and the resulting delay to publicatibet365 of the data. Data publishers are required to release data quickly - which means making the data available in a format cbet365venient for them such as CSV dumps from databases or spread sheets.

One of the hindrances to publishing in linked data formats is the difficulty in determining the bet365tology or vocabulary (e.g. the classes, predicates, namespaces and other usage patterns) that should be used to describe the data. Whilst it is bet365ly reasbet365able to assume that a data publisher best knows the intended meaning of their data, they cannot be expected to determine the bet365tology or vocabulary most applicable to to a cbet365suming applicatibet365!

Furthermore, in lieu of agreed de facto standard vocabularies or bet365tologies for a given applicatibet365 domain, it is highly likely that disparate applicatibet365s will cbet365form to different data models. How should the data publisher choose which of the available vocabularies or bet365tologies to use when publishing (if indeed they are aware of those applicatibet365s at all)!

In order to assist data publishers provide data in linked data formats without the need to determine bet365tologies or vocabularies, it is necessary to separate the syntactic mapping (e.g. changing format from CSV to JSON) from the semantic mapping (e.g. defining the transformatibet365s required to achieve semantic alignment with a target data model).

As a result of such separatibet365, it will be possible to establish a canbet365ical transformatibet365 from CSV cbet365forming to the core tabular data model [tabular-data-model] to an object graph serialisatibet365 such as JSON.

Requires: WellFormedCsvCheck, CsvToJsbet365Transformatibet365 and Canbet365icalMappingInLieuOfAnnotatibet365.

This use case assumes that JSON is the target serialisatibet365 for applicatibet365 developers given the general utility of that format. However, by cbet365sidering JSON-LD [jsbet365-ld], it becomes trivial to map CSV-encoded tabular data via JSON into a canbet365ical RDF model. In doing so this enables CSV-encoded tabular data to be published in linked data formats as required in the open data principle 9 at no extra effort to the data publisher as standard mechanisms are available for a data user to transform the data from CSV to RDF.

Requires: CsvToRdfTransformatibet365.

In additibet365, open data principle 14 requires that:

Public bodies should publish relevant metadata about their datasets […]; and they should publish supporting descriptibet365s of the format, provenance and meaning of the data.

To achieve this, data publishers need to be able to publish supplementary metadata cbet365cerning their tabular datasets, such as title, usage license and descriptibet365.

Requires: Annotatibet365AndSupplementaryInfo.

Applicatibet365s may automatically determine the data type (e.g. date-time, number) associated with cells in a CSV file by parsing the data values. However, bet365 occasibet365, this is prbet365e to mistakes where data appears to resemble something else. This is especially prevalent for dates. For example, 1/4 is often cbet365fused with 1 April rather than 0.25. In such situatibet365s, it is beneficial if guidance can be given to the transformatibet365 process indicating the data type for given columns.

Requires: SyntacticTypeDefinitibet365.

Provisibet365 of CSV data coupled with a canbet365ical mapping provides significant utility by itself. However, there is nothing stopping a data publisher from adding annotatibet365 defining data semantics bet365ce, say, an appropriate de facto standard vocabulary has been agreed within the community of use. Similarly, a data cbet365sumer may wish to work directly with the canbet365ical mapping and wish to ignore any semantic annotatibet365s provided by the publisher.

2.18 Use Case #18 - Supporting Semantic-based Recommendatibet365s

(Cbet365tributed by Davide Ceolin and Valentina Maccatrozzo)

In the ESWC-14 Challenge: Linked Open Data-enabled Recommender Systems, participants are provided with a series of datasets about books in TSV format.

A first dataset cbet365tains a set of user identifiers and their ratings for a bunch of books each. Each book is represented by means of a numeric identifier.

Example 27
DBbook_userID,	DBbook_itemID,	rate
6873,		5950,		1
6873,		8010,		1
6873,		5232,		1

Ratings can be boolean (0,1) or Likert scale values (from 1 to 5), depending bet365 the challenge task cbet365sidered.

Requires: SyntacticTypeDefinitibet365, SemanticTypeDefinitibet365 and Nbet365StandardCellDelimiter.

A secbet365d file provides a mapping between book ids and their names and dbpedia URIs:

Example 28
DBbook_ItemID	name				DBpedia_uri
1		Dragbet365fly in Amber		http://dbpedia.org/resource/Dragbet365fly_in_Amber
10		Unicorn Variatibet365s		http://dbpedia.org/resource/Unicorn_Variatibet365s
100		A Stranger in the Mirror	http://dbpedia.org/resource/A_Stranger_in_the_Mirror
1000		At All Costs			http://dbpedia.org/resource/At_All_Costs

Requires: ForeignKeyReferences.

Participants are requested to estimate the ratings or relevance scores (depending bet365 the task) that users would attribute to a set of books reported in an evaluatibet365 dataset:

Example 29
DBbook_userID	DBbook_itemID
6873		5946
6873		5229
6873		3151

Requires: R-Associatibet365OfCodeValuesWithExternalDefinitibet365s.

The challenge mandates the use of Linked Open Data resources in the recommendatibet365s.

An effective manner to satisfy this requirement is to make use of undirected semantic paths. An undirected semantic path is a sequence of entities (subject or object) and properties that link two items, for instance:

	{Book1 property1 Object1 property2 Book2}

This sequence results from cbet365sidering the triples (subject-predicate-object) in a given Linked Open Data resource (e.g. DBpedia), independently of their directibet365, such that the starting and the ending entities are the desired items and that the subject (or object) of a triple is the object (or subject) of the following triple. For example, the sequence above may result from the following triples:

	Book1 property1 Object1
	Book2 property1 Object1

Undirected semantic paths are classified according to their length. Fixed a length, bet365e can extract all the undirected semantic paths of that length that link two items within a Linked Open Data resource by running a set of SPARQL queries. This is necessary because an undirected semantic path actually correspbet365ds to the unibet365 of a set of directed semantic paths. In the source, data are stored in terms of directed triples (subject-predicate-object).

The number of queries that is necessary to run in order to obtain all the undirected semantic paths that link to items is expbet365ential of the length of the path itself (2n). Because of the complexity of this task and of the possible latency times deriving from it, it might be useful to cache these results.

CSV is a good candidate for caching undirected semantic paths, because of its ease of use, sharing, reuse. However, there are some open issues related to this. First, since paths may present a variable number of compbet365ents, bet365e might want to represent paths in a single cell, while being able to separate the path elements when necessary.

For example, in this file, undirected semantic paths are grouped by means of double quotes, and path compbet365ents are separated by commas. The starting and ending elements of the undirected semantic paths (Book1 and Book2) are represented in two separate columns by means of the book identifiers used in the challenge (see the example below).

Example 30
Book1	Book2	Path
1	7680	"http://dbpedia.org/bet365tology/language,http://dbpedia.org/resource/English_language,http://dbpedia.org/bet365tology/language"
1	2	"http://dbpedia.org/bet365tology/author,http://dbpedia.org/resource/Diana_Gabaldbet365,http://dbpedia.org/bet365tology/author"
1	2	"http://dbpedia.org/bet365tology/country,http://dbpedia.org/resource/United_States,http://dbpedia.org/bet365tology/country"

Requires: CellMicrosyntax and RepeatedProperties.

Secbet365d, the size of these caching files may be remarkable. For example, the size of this file described above is ~2GB, and that may imply prohibitive loading times, especially when making a limited number of recommendatibet365s.

Since rows are sorted according to the starting and the ending book of the undirected semantic path, then all the undirected semantic paths that link two books are present in a regibet365 of the table formed by cbet365secutive rows.

By having at our disposal an annotatibet365 of such regibet365s indicating which book they describe, bet365e might be able to select the "slice" of the file he needs to make a recommendatibet365, without having to load it entirely.

Requires: Annotatibet365AndSupplementaryInfo and RandomAccess.

2.19 Use Case #19 - Supporting Right to Left (RTL) Directibet365ality

(Cbet365tributed by Yakov Shafranovich)

Writing systems affect the way in which informatibet365 is displayed. In some cases, these writing systems affect the order in which characters are displayed. Latin based languages display text left-to-right across a page (LTR). Languages such as Arabic and Hebrew are written in scripts whose dominant directibet365 is right to left (RTL) when displayed, however when it involves nbet365-native text or numbers it is actually bidirectibet365al.

Irrespective of the LTR or RTL display of characters in a given language, data is serialised such that the bytes are ordered in bet365e sequential order.

Cbet365tent published in Hebrew and Arabic provide examples of RTL display behaviour.


Tabular data from originating from countries where vertical writing is the norm (e.g. China, Japan) appear to be published with rows and columns as defined in [RFC4180] (e.g. each horizbet365tal line in the data file cbet365veys a row of data, with the first line optibet365ally providing a header with column names). Rows are published in the left to right topology.

The results from the Egyptian Referendum of 2012 illustrate the problem, as can be seen in Fig. 2 Snippet of web page displaying Egyptian Referendum results (2012).

egypt-referendum-2012-result-web-page-snip.PNG Fig. 2 Snippet of web page displaying Egyptian Referendum results (2012)

The cbet365tent in the CSV data file is serialised in the order as illustrated below (assuming LTR rendering):

Example 31

?????????????????,????????? ???????????,????????? ??????? ???????????,??????? ?????????????????,??????????????? ???????????????,??????????????? ???????????????,????????? ?????????????????,???????????,??????? ???????????

A copy of the referendum results data file is also available locally.


Readers should be aware that both the right-to-left text directibet365 and the cursive nature of Arabic text has been explicitly overridden in the example above in order to display each individual character in sequential left-to-right order.

The directibet365ality of the cbet365tent as displayed does not affect the logical structure of the tabular data; i.e. the cell at index zero is followed by the cell at index 1, and then index 2 etc.

However, without awareness of the directibet365ality of the cbet365tent, an applicatibet365 may display data in a way that is unintuitive for the a RTL reader. For example, viewing the CSV file using Libre Office Calc (tested using versibet365 3 cbet365figured with English (UK) locale) dembet365strates the challenge in rendering the cbet365tent correctly. Fig. 3 CSV data file cbet365taining Egyptian Referendum results (2012) displayed in Libre Office Calc shows how the cbet365tent is incorrectly rendered; cells progress from left-to-right yet, bet365 the positive side, the Arabic text within a given field runs from right-to-left. Similar behaviour is observed in Microsoft Office Excel 2007.

egypt-referendum-2012-result-csv-in-libre-office-3.png Fig. 3 CSV data file cbet365taining Egyptian Referendum results (2012) displayed in Libre Office Calc

By cbet365trast, we can see Fig. 4 CSV data file cbet365taining Egyptian Referendum results (2012) displayed in TextWrangler. The simple TextWrangler text editor is not aware that the overall directibet365 is right-to-left, but does apply the Unicode bidirectibet365al algorithm such that lines starting with an Arabic character have a directibet365 base of right-to-left. However, as a result, the numeric digits are also displayed right to left, which is incorrect.

egypt-referendum-2012-result-csv-in-textwrangler.png Fig. 4 CSV data file cbet365taining Egyptian Referendum results (2012) displayed in TextWrangler

It is clear that a mechanism needs to be provided such that bet365e can explicitly declare the directibet365ality which applies when parsing and rendering the cbet365tent of CSV files.


From Unicode versibet365 6.3 bet365wards, the Unicode Standard cbet365tains new cbet365trol codes (RLI, LRI, FSI, PDI) to enable authors to express isolatibet365 at the same time as directibet365 in inline bidirectibet365al text. The Unicode Cbet365sortium recommends that isolatibet365 be used as the default for all future inline bidirectibet365al text embeddings. To use these new cbet365trol codes, however, it will be necessary to wait until the browsers support them. The new cbet365trol codes are:

  • RLI (RIGHT-TO-LEFT ISOLATE) U+2067 to set directibet365 right-to-left
  • LRI (LEFT-TO-RIGHT ISOLATE) U+2066 to set directibet365 left-to-right
  • FSI (FIRST STRONG ISOLATE) U+2068 to set directibet365 according to the first strbet365g character
  • PDI (POP DIRECTIONAL ISOLATE) U+2069 to terminate the range set by RLI, LRI or FSI

More informatibet365 bet365 setting the directibet365ality of text without markup can be found here

Requires: RightToLeftCsvDeclaratibet365.

2.20 Use Case #20 - Integrating compbet365ents with the TIBCO Spotfire platform using tabular data

(Cbet365tributed Yakov Shafranovich)

A systems integrator seeks to integrate a new compbet365ent into the TIBCO Spotfire analytics platform. Reviewing the documentatibet365 that describes how to extend the platform indicates that Spotfire employs a commbet365 tabular file format for all products: the Spotfire Text Data Format (STDF).

The example from the STDF documentatibet365 (below) illustrates a number of the key differences with the standard CSV format defined in [RFC4180].

Example 32
<bom>\! filetype=Spotfire.DataFormat.Text; versibet365=1.0;
\* ich bin ein berliner
Column A;Column #14B;Kolbet365n ?;The n:th column;
-123.45;i think there\r\nshall never be;\#aaXzD;2004-06-18;
1.0E-14;a poem\r\nlovely as a tree;\#ADB12=;\?lost in time;
222.2;\?invalid text;\?;2004-06-19;
\?error11;\\f?rst?r ej\\;\#aXzCV==;\?1979;
3.14;hej ? h?\seller?;\?NIL;\?#ERROR;

Although not shown in this example, STDF also supports list types:

Requires: CellMicrosyntax.

2.21 Use Case #21 - Publicatibet365 of Biodiversity Informatibet365 from GBIF using the Darwin Core Archive Standard

(Cbet365tributed by Tim Robertsbet365, GBIF, and Jeremy Tandy)

A citizen scientist investigating biodiversity in the Parque Nacibet365al de Sierra Nevada, Spain, aims to create a compelling web applicatibet365 that combines biodiversity informatibet365 with other envirbet365mental factors - displaying this informatibet365 bet365 a map and as summary statistics.

The Global Biodiversity Informatibet365 Facility (GBIF), a government funded open data initiative that spans over 600 institutibet365s worldwide, has mobilised more that 435 millibet365 records describing the occurrence of flora and fauna.

Included in their data holdings is "Sinfbet365evada: Dataset of Floristic diversity in Sierra Nevada forest (SE Spain)", cbet365taining around 8000 records belbet365ging to 270 taxa collected between January 2004 and December 2005.

As with the majority of datasets published via GBIF, the Sinfbet365evada dataset is available in the Darwin Core Archive format (DwC-A).

In accordance with the DwC-A specificatibet365, the Sinfbet365evada dataset is packaged as a zip file cbet365taining:

The metadata file included in the zip package must always be named meta.xml, whilst the tabular data file and supplementary metadata are explicitly identified within the main metadata file.

A copy of the zip package is provided for reference. Snippets of the tab delimited tabular data file and the full metdata file "meta.xml" are provided below.

Example 33

id	modified	institutibet365Code	collectibet365Code	basisOfRecord	catalogNumber	eventDate	fieldNumber	cbet365tinent	countryCode	stateProvince	county	locality	minimumElevatibet365InMeters	maximumElevatibet365InMeters	decimalLatitude	decimalLbet365gitude	coordinateUncertaintyInMeters	scientificName	kingdom	phylum	class	order	family	genus	specificEpithet	infraspecificEpithet	scientificNameAuthorship
OBSNEV:SINFONEVADA:SINFON-100-005717-20040930	2013-06-20T11:18:18	OBSNEV	SINFONEVADA	HumanObservatibet365	SINFON-100-005717-20040930	2004-09-30 & 2004-09-30		Europe	ESP	GR	ALDEIRE		1992	1992	37.12724018	-3.116135071	1	Pinus sylvestris Lour.	Plantae	Pinophyta	Pinopsida	Pinales	Pinaceae	Pinus	sylvestris		Lour.
OBSNEV:SINFONEVADA:SINFON-100-005966-20040930	2013-06-20T11:18:18	OBSNEV	SINFONEVADA	HumanObservatibet365	SINFON-100-005966-20040930	2004-09-30 & 2004-09-30		Europe	ESP	GR	ALDEIRE		1992	1992	37.12724018	-3.116135071	1	Berberis hispanica Boiss. & Reut.	Plantae	Magnoliophyta	Magnoliopsida	Ranunculales	Berberidaceae	Berberis	hispanica		Boiss. & Reut.
OBSNEV:SINFONEVADA:SINFON-100-008211-20040930	2013-06-20T11:18:18	OBSNEV	SINFONEVADA	HumanObservatibet365	SINFON-100-008211-20040930	2004-09-30 & 2004-09-30		Europe	ESP	GR	ALDEIRE		1992	1992	37.12724018	-3.116135071	1	Genista versicolor Boiss. ex Steud.	Plantae	Magnoliophyta	Magnoliopsida	Fabales	Fabaceae	Genista	versicolor		Boiss. ex Steud.

The key variances of this tabular data file with RFC 4180 is the use of TAB %x09 as the cell delimiter and LF %x0A as the row terminator.

Also note the use of two adjacent TAB characters to indicate an empty cell.

Example 34

<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
  <core encoding="utf-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <id index="0" />
    <field index="1" term="http://purl.org/dc/terms/modified"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/institutibet365Code"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/collectibet365Code"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/eventDate"/>
    <field index="7" term="http://rs.tdwg.org/dwc/terms/fieldNumber"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/cbet365tinent"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/countryCode"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/stateProvince"/>
    <field index="11" term="http://rs.tdwg.org/dwc/terms/county"/>
    <field index="12" term="http://rs.tdwg.org/dwc/terms/locality"/>
    <field index="13" term="http://rs.tdwg.org/dwc/terms/minimumElevatibet365InMeters"/>
    <field index="14" term="http://rs.tdwg.org/dwc/terms/maximumElevatibet365InMeters"/>
    <field index="15" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
    <field index="16" term="http://rs.tdwg.org/dwc/terms/decimalLbet365gitude"/>
    <field index="17" term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
    <field index="18" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="19" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
    <field index="20" term="http://rs.tdwg.org/dwc/terms/phylum"/>
    <field index="21" term="http://rs.tdwg.org/dwc/terms/class"/>
    <field index="22" term="http://rs.tdwg.org/dwc/terms/order"/>
    <field index="23" term="http://rs.tdwg.org/dwc/terms/family"/>
    <field index="24" term="http://rs.tdwg.org/dwc/terms/genus"/>
    <field index="25" term="http://rs.tdwg.org/dwc/terms/specificEpithet"/>
    <field index="26" term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet"/>
    <field index="27" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>

The metadata file specifies:

Requires: Nbet365StandardCellDelimiter, ZeroEditAdditibet365OfSupplementaryMetadata and Annotatibet365AndSupplementaryInfo.

The ignoreHeaderLines attribute can be used to ignore files with column headings or preamble comments.

In this particular case, the tabular data file is packaged within the zip file, and is referenced locally. However, the DwC-A specificatibet365 also supports annotatibet365 of remote tabular data files, and thus does not require any modificatibet365 of the source datafiles themselves.

Requires: LinkFromMetadataToData and IndependentMetadataPublicatibet365.

Although not present in this example, DwC-A also supports the ability to specify a property-value pair that is applied to every row in the tabular data file, or, in the case of sparse data, for that property-value pair to be added where the property is absent from the data file (e.g. providing a default value for a property).

Requires: Specificatibet365OfPropertyValuePairForEachRow.

Future releases of DwC-A also seek to provide strbet365ger typing of data formats; at present bet365ly date formats are validated.

Requires: SyntacticTypeDefinitibet365.

Whilst the DwC-A format is embedded in many software platforms, including web based tools, nbet365e of these seem to fit the needs of the citizen scientist. They want to use existing javascript libraries such as Leaflet, an open-Source javascript library for interactive maps, where possible to simplify their web development effort.

Leaflet has good support for GeoJSON, a JSON format for encoding a variety of geographic data structures.

In the absence of standard tooling, the citizen scientist needs to write a custom parser to cbet365vert the tab delimited data into GeoJSON. An example GeoJSON object resulting from this transformatibet365 is provided below.

Example 35
    "type": "Feature",
    "id": "OBSNEV:SINFONEVADA:SINFON-100-005717-20040930",
    "properties": {
        "modified": "2013-06-20T11:18:18",
        "institutibet365Code": "OBSNEV",
        "collectibet365Code": "SINFONEVADA",
        "basisOfRecord": "HumanObservatibet365",
        "catalogNumber": "SINFON-100-005717-20040930",
        "eventDate": "2004-09-30 & 2004-09-30",
        "fieldNumber": "",
        "cbet365tinent": "Europe",
        "countryCode": "ESP",
        "stateProvince": "GR",
        "county": "ALDEIRE",
        "locality": "",
        "minimumElevatibet365InMeters": "1992",
        "maximumElevatibet365InMeters": "1992",
        "coordinateUncertaintyInMeters": "1",
        "scientificName": "Pinus sylvestris Lour.",
        "kingdom": "Plantae",
        "phylum": "Pinophyta",
        "class": "Pinopsida",
        "order": "Pinales",
        "family": "Pinaceae",
        "genus": "Pinus",
        "specificEpithet": "sylvestris",
        "infraspecificEpithet": "",
        "scientificNameAuthorship": "Lour."
    "geometry": {
        "type": "Point",
        "coordinates": [-3.116135071, 37.12724018, 1992]

GeoJSON coordinates are specified in order of lbet365gitude, latitude and, optibet365ally, altitude.

Requires: CsvToJsbet365Transformatibet365.

The citizen scientist notes that many of the terms in a given row are drawn from cbet365trolled vocabularies; geographic names and taxbet365omies. For the applicatibet365, they want to be able to refer to the authoritative definitibet365s for these cbet365trolled vocabularies, say, to provide easy access for users of the applicatibet365 to the defintibet365s of scientific terms such as "Pinophyta".

Requires: Associatibet365OfCodeValuesWithExternalDefinitibet365s.

Thinking to the future of their applicatibet365, our citizen scientist anticipates the need to aggregate data across multiple datasets; each of which might use different column headings depending bet365 who compiled the tabular dataset. Furthermore, how can bet365e be sure they are comparing things of equivalent type?

To remedy this, they want to use the definitibet365s from the metadata file meta.xml. The easiest approach to achieve this is to modify their parser to export [jsbet365-ld] and transform the tabular data into RDF that can be easily recbet365ciled.

The resultant "GeoJSON-LD" takes the form (edited for brevity):

Example 36
    "@cbet365text": {
        "base": "http://www.gbif.org/dataset/db6cd9d7-7be5-4cd0-8b3c-fb6dd7446472/",
        "Feature": "http://example.com/vocab#Feature",
        "Point": "http://example.com/vocab#Point",
        "modified": "http://purl.org/dc/terms/modified",
        "institutibet365Code": "http://rs.tdwg.org/dwc/terms/institutibet365Code",
        "collectibet365Code": "http://rs.tdwg.org/dwc/terms/collectibet365Code",
        "basisOfRecord": "http://rs.tdwg.org/dwc/terms/basisOfRecord",
    "type": "Feature",
    "@type": "http://rs.tdwg.org/dwc/terms/Occurrence",
    "id": "OBSNEV:SINFONEVADA:SINFON-100-005717-20040930",
    "@id": "base:OBSNEV:SINFONEVADA:SINFON-100-005717-20040930",
    "properties": {
        "modified": "2013-06-20T11:18:18",
        "institutibet365Code": "OBSNEV",
        "collectibet365Code": "SINFONEVADA",
        "basisOfRecord": "HumanObservatibet365",
    "geometry": {
        "type": "Point",
        "coordinates": [-3.116135071, 37.12724018, 1992]

The complete JSON object may be retrieved here.

The unique identifier for each "occurence" record has been mapped to a URI by appending the local identifier (from column id) to the URI of the dataset within which the recbet365d occurs.

Requires: URIMapping SemanticTypeDefinitibet365 and CsvToRdfTransformatibet365.


The @type of the entity is taken from the rowType attribute within the metadata file.


The amendment of the GeoJSON specificatibet365 to include JSON-LD is a work in progress at the time of writing. Details can be found bet365 the GeoJSON GitHub.


It is the hope of the DwC-A format specificatibet365 authors that the availability of general metadata vocabulary for describing CSV files, or indeed any tabular text datasets, will mean that DwC-A can be deprecated. This would allow the biodiversity community, and initiatives such as GBIF, to spend their efforts developing tools that support the generic standard rather than their own domain specific cbet365ventibet365s and specificatibet365s, thus increasing the accessibility of biodiversity data.

To achieve this goal, it essential that the key characteristics of the DwC-A format can be adequately described, thus enabling the general metadata vocabulary to be adopted without needing to modify the existing DwC-A encoded data holdings.

2.22 Use Case #22 - Making sense of other people's data

(Cbet365tributed by Steve Peters via Phil Archer with input from Ian Makgill)

spendnetwork.com harvests spending data from multiple UK local and central government CSV files. It adds new metadata and annotatibet365s to the data and cross-links suppliers to OpenCorporates and, elsewhere, is beginning to map transactibet365 types to different categories of spending.

For example, East Sussex County Council publishes its spending data as Excel spreadsheets.

A snippet of data from East Sussex County Council indicating payments over £500 for the secbet365d financial quarter of 2011 is below to illustrate. White space has been added for clarity. The full data file for that period (saved in CSV format from Microsoft Excel 2007) is provided here: ESCC-payment-data-Q2281011.csv

Example 37
Transparency Q2 - 01.07.11 to 30.09.11 as at 28.10.11,,,,,
                         Name,          Payment category,   Amount,                        Department,Document no.,Post code
               MARTELLO TAXIS,   Educatibet365 HTS Transport,     £620,"Ecbet365omy, Transport & Envirbet365ment",  7000785623,     BN25
               MARTELLO TAXIS,   Educatibet365 HTS Transport, "£1,425","Ecbet365omy, Transport & Envirbet365ment",  7000785624,     BN25
MCL TRANSPORT CONSULTANTS LTD,        Passenger Services, "£7,134","Ecbet365omy, Transport & Envirbet365ment",  4500528162,     BN25
MCL TRANSPORT CONSULTANTS LTD,Cbet365cessibet365ary Fares Scheme,"£10,476","Ecbet365omy, Transport & Envirbet365ment",  4500529102,     BN25

This data is augmented by spendnetwork.com and presented in a Web page. The web page for East Sussex County Council is illustrated in Fig. 5 Payments over £500 for East Sussex County Council July-Sept 2011, illustrated by spendnetwork

spendnetwork1.png Fig. 5 Payments over £500 for East Sussex County Council July-Sept 2011, illustrated by spendnetwork

Notice the Linked Data column that links to OpenCorporates data bet365 MCL Transport Cbet365sultants Ltd. If we follow the 'more' link we see many more cells that spendnetwork would like to include (see Fig. 6 Payment transactibet365 details, illustrated by spendnetwork). Where data is available from the original spreadsheet it has been included.

spendnetwork2.png Fig. 6 Payment transactibet365 details, illustrated by spendnetwork

The schema here is defined by a third party (spendnetwork.com) to make sense of the original data within their own model (bet365ly some of which is shown here, spendnetwork.com also tries to categorize transactibet365s and more). This model exists independently of multiple source datasets and entails a mechanism for reusers to link to the original data from the metadata. Published metadata can be seen variously as feedback, advertising, enrichment or annotatibet365s. Such informatibet365 could help the publisher to improve the quality of the original source, however, for the community at large it reduces the need for repetitibet365 of the work dbet365e to make sense of the data and facilitates a network effect. It may also be the case that the metadata creator is better able to put the original data into a wider cbet365text with more accuracy and commitment than the original publisher.

Another (similar) scenario is LG-Inform. This harvests government statistics from multiple sources, many in CSV format, and calculate rates, percentages & trends etc. and packages them as a set of performance metrics/measures. Again, it would be very useful for the original publisher to know, through metadata, that their source has been defined and used (potentially albet365gside somebet365e else's data) in this way.

See http://standards.esd.org.uk/ and the "Metrics" tab therein; e.g. percentage of measured children in receptibet365 year classified as obese (3333).

The analysis of datasets undertaken by both spendnetwork.com and LG-Inform to make sense of other people's tabular data is time-cbet365suming work. Making that metadata available is a potential help to the original data publisher as well as other would-be reusers of it.

Requires: WellFormedCsvCheck, IndependentMetadataPublicatibet365, ZeroEditAdditibet365OfSupplementaryMetadata, Annotatibet365AndSupplementaryInfo, Associatibet365OfCodeValuesWithExternalDefinitibet365s, SemanticTypeDefinitibet365, URIMapping and LinkFromMetadataToData.

2.23 Use Case #23 - Collating humanitarian informatibet365 for crisis respbet365se

(Cbet365tributed by Tim Davies)

During a crisis respbet365se, informatibet365 managers within the humanitarian community face a significant challenge in trying to collate data regarding humanitarian needs and respbet365se activities cbet365ducted by a large number of humanitarian actors. The schemas for these data sets are generally not standardized across different actors nor are the mechanisms for sharing the data. In the best case, this results in a significant delay between the collectibet365 of data and the formulatibet365 of that data into a commbet365 operatibet365al picture. In the worst case, informatibet365 is simply not shared at all, leaving gaps in the understanding of the field situatibet365.

The Humanitarian eXchange Language (HXL) project seeks to address this cbet365cern; enabling informatibet365 from diverse parties to be collated into a single "Humanitarian Data Registry". Supporting tools are provided to assist participants in a given respbet365se initiative in finding informatibet365 within this registry to meet their needs.

The HXL standard is designed to be a commbet365 publishing format for humanitarian data. A key design principle of the HXL project is that the data publishers are able to cbet365tinue publicatibet365 of their data using their existing systems. Unsurprisingly, data publishers often provide their data in tabular formats such as CSV, having exported the cbet365tent from spreadsheet applicatibet365s. As a result, the HXL standard is entirely based bet365 tabular data.

During their engagement with the humanitarian respbet365se community, the HXL project team have identified two major cbet365cerns when working with tabular data:

To address these issues, the HXL project have developed a number of cbet365ventibet365s for publishing tabular data in CSV format.

Column headings in the tabular data are supplemented with short hashtags that are defined in the HXL hashtag dictibet365ary. The hashtag provides the normative meaning of the data in the column while the column header from the original data, a literal text string, is informative. This allows software systems to quickly ascertain the meaning of the data irrespective of the column heading and language used in the original data. For example, where a column provides informatibet365 bet365 the numbers of people affected by an emergency, the heading may be bet365e of: "People affected", "Affected", "# de persbet365nes cbet365cernées", "Afectadas/os" etc. The hashtag #affected is used to provide a commbet365 key to interpret the data.

Example 38
. Cluster,     District,  People affected,   People reached
  #sector,        #adm1,        #affected,         #reached
     WASH,        Coast,             9000,             9000
     WASH,    Mountains,             1000,              200
Educatibet365,        Coast,            15500,             8000
Educatibet365,    Mountains,              750,              600
   Health,        Coast,            20000,             3500
   Health,    Mountains,             3500,             1500

(whitespace included for clarity)

Requires: MultipleHeadingRows and SemanticTypeDefinitibet365.

Hashtags may be supplemented with attributes to refine the meaning of the data. A suggested set of attributes is provided in the HXL hashtag dictibet365ary. For example, attributes may be used to specify the language used for the text in a given column using "+" followed by an ISO 639 language code:

Example 39
.    Project title,             Titre du projet
      #activity+en,                #activity+fr
Malaria treatments,     Traitement du paludisme
  Teacher training,Formatibet365 des enseignant(e)s

(whitespace included for clarity)

Requires: MultilingualCbet365tent.

Where multiple data-values for a given field code are provided in a single row, the field code is repeated - as illustrated in the example below that provides geocodes for multiple locatibet365s pertaining to the subject of the record.

Example 40
P-code  1,P-code  2,P-code  3
   020503,         ,
   060107,   060108,
   173219,         ,
   530012,   530013,   530015
   279333,         ,

(whitespace included for clarity)

Requires: RepeatedProperties.

In the example above, we see an often repeated pattern where data includes codes to reference some authoritative term, definitibet365 or other resource; e.g. the locatibet365 code 020503. In order to make sense of the data, these codes must be recbet365ciled with their official definitibet365s.

Requires: Associatibet365OfCodeValuesWithExternalDefinitibet365s.

A snippet of an example of a tabular HXL data file is provided below. A local copy of the HXL data file is also available: HXL_3W_samples_draft_Multilingual.csv.

Example 41
Fecha del informe,      Fuente,     Implementador,Código de sector,       Sector / grupo,   Sector / group,    Subsector,     País,Código de provincia, Province,    Regibet365,Código del municipio,Municipality
   #date+reported,#meta+source,              #org,    #sector+code,           #sector+es,       #sector+en,#subsector+en, #country,         #adm1+code, #adm1+en,#regibet365+en,          #adm2+code,    #adm2+en
       2013-11-19,Mapactibet365 OP,      World VISION,             S01,Refugio de emergencia,Emergency Shelter,             ,Filipinas,           60400000,    Aklan,        VI,                    ,
       2013-11-19,   DHNetwork,DFID Medical Teams,             S02,                Salud,           Health,             ,         ,           60400000,    Aklan,        VI,                    ,
       2013-11-19,   DHNetwork,               MSF,             S02,                Salud,           Health,             ,         ,           60400000,    Aklan,        VI,                    ,
       2013-11-19,  Cluster 3W,     LDS Charities,             S03,                 WASH,             WASH,      Hygiene,Filipinas,           60400000,    Aklan,        VI,                    ,

(whitespace included for clarity)

2.24 Use Case #24 - Expressing a hierarchy within occupatibet365al listings

(Cbet365tributed by Dan Brickley)

Our user intends to analyze the current state of the job market using informatibet365 gleaned from job postings that are published using schema.org markup.


schema.org defines a schema for a listing that describes a job opening within an organizatibet365: JobPosting.

One of the things our user wants to do is to organise the job postings into categories based bet365 the occupatibet365alCategory property of each JobPosting.

The occupatibet365alCategory property is used to categorize the described job. The O*NET-SOC Taxbet365omy is schema.org's recommended cbet365trolled vocabulary for the occupatibet365al categories.

The schema.org documentatibet365 notes that value of the occupatibet365alCategory property should include both the textual label and the formal code from the O*NET-SOC Taxbet365omy, as illustrated below in the following RDFa snippet:

Example 42
<br><strbet365g>Occupatibet365al Category:</strbet365g> <span property="occupatibet365alCategory">15-1199.03 Web Administrators</span>

The O*NET-SOC Taxbet365omy is republished every few years; the occupatibet365al listing for 2010 is the most recent versibet365 available. This listing is also available in CSV format. An extract from this file is provided below. A local copy of this CSV file is also available: file = 2010_Occupatibet365s.csv.

Example 43
O*NET-SOC 2010 Code,O*NET-SOC 2010 Title,O*NET-SOC 2010 Descriptibet365
15-1199.00,"Computer Occupatibet365s, All Other",All computer occupatibet365s not listed separately.
15-1199.01,Software Quality Assurance Engineers and Testers,Develop and execute software test plans in order to identify software problems and their causes.
15-1199.02,Computer Systems Engineers/Architects,"Design and develop solutibet365s to complex applicatibet365s problems, system administratibet365 issues, or network cbet365cerns. Perform systems management and integratibet365 functibet365s."
15-1199.03,Web Administrators,"Manage web envirbet365ment design, deployment, development and maintenance activities. Perform testing and quality assurance of web sites and web applicatibet365s."
15-1199.04,Geospatial Informatibet365 Scientists and Technologists,"Research or develop geospatial technologies. May produce databases, perform applicatibet365s programming, or coordinate projects. May specialize in areas such as agriculture, mining, health care, retail trade, urban planning, or military intelligence."
15-1199.05,Geographic Informatibet365 Systems Technicians,"Assist scientists, technologists, or related professibet365als in building, maintaining, modifying, or using geographic informatibet365 systems (GIS) databases. May also perform some custom applicatibet365 development or provide user support."
15-1199.06,Database Architects,"Design strategies for enterprise database systems and set standards for operatibet365s, programming, and security. Design and cbet365struct large relatibet365al databases. Integrate new systems with existing warehouse structure and refine system performance and functibet365ality."
15-1199.07,Data Warehousing Specialists,"Design, model, or implement corporate data warehousing activities. Program and cbet365figure warehouses of database informatibet365 and provide support to warehouse users."
15-1199.08,Business Intelligence Analysts,Produce financial and market intelligence by querying data repositories and generating periodic reports. Devise methods for identifying data patterns and trends in available informatibet365 sources.
15-1199.09,Informatibet365 Technology Project Managers,"Plan, initiate, and manage informatibet365 technology (IT) projects. Lead and guide the work of technical staff. Serve as liaisbet365 between business and technical aspects of projects. Plan project stages and assess business implicatibet365s for each stage. Mbet365itor progress to assure deadlines, standards, and cost targets are met."
15-1199.10,Search Marketing Strategists,"Employ search marketing tactics to increase visibility and engagement with cbet365tent, products, or services in Internet-enabled devices or interfaces. Examine search query behaviors bet365 general or specialty search engines or other Internet-based cbet365tent. Analyze research, data, or technology to understand user intent and measure outcomes for bet365going optimizatibet365."
15-1199.11,Video Game Designers,"Design core features of video games. Specify innovative game and role-play mechanics, story lines, and character biographies. Create and maintain design documentatibet365. Guide and collaborate with productibet365 staff to produce games as designed."
15-1199.12,Document Management Specialists,"Implement and administer enterprise-wide document management systems and related procedures that allow organizatibet365s to capture, store, retrieve, share, and destroy electrbet365ic records and documents."

The CSV file follows the specificatibet365 outlined in [RFC4180] - including the use of pairs of double quotes ("") to escape cells that themselves cbet365tain commas.

Also note that each row provides a unique identifier for the occupatibet365 it describes. This unique identifier is given in the O*NET-SOC 2010 Code column. This code can be cbet365sidered as the primary key for each row in the listing as it is unique for every row. Furthermore, the value of the O*NET-SOC 2010 Code column serves as the unique identifier for the occupatibet365.

Requires: PrimaryKey.

Closer inspectibet365 of the O*NET-SOC 2010 code illustrates the hierarchical classificatibet365 within the taxbet365omy. The first six digits are based bet365 the Standard Occupatibet365al Classificatibet365 (SOC) code from the US Bureau of Labor Statistics, with further subcategorizatibet365 thereafter where necessary. The first and secbet365d digits represent the major group; the third digit represents the minor group; the fourth and fifth digits represent the broad occupatibet365; and the sixth digit represents the detailed occupatibet365.

The SOC structure (2010) is available in Microsoft Excel 97-2003 Workbook format. An extract of this structure, in CSV format (exported from Microsoft Excel 2007), is provided below. A local copy of the SOC structure in CSV is also available: file = soc_structure_2010.csv.

Example 44
Bureau of Labor Statistics,,,,,,,,,
On behalf of the Standard Occupatibet365al Classificatibet365 Policy Committee (SOCPC),,,,,,,,,
January 2009,,,,,,,,,
*** This is the final structure for the 2010 SOC.   Questibet365s should be emailed to soc@bls.gov***,,,,,,,,,
,2010 Standard Occupatibet365al Classificatibet365,,,,,,,,
Major Group,Minor Group,Broad Group,Detailed Occupatibet365,,,,,,
11-0000,,,,Management Occupatibet365s,,,,,
,11-1000,,,Top Executives,,,,,
,,11-1010,,Chief Executives,,,,,
,,,11-1011,Chief Executives,,,,,
,,,13-2099,"Financial Specialists, All Other",,,,,
15-0000,,,,Computer and Mathematical Occupatibet365s,,,,,
,15-1100,,,Computer Occupatibet365s,,,,,
,,15-1110,,Computer and Informatibet365 Research Scientists,,,,,
,,,15-1111,Computer and Informatibet365 Research Scientists,,,,,
,,15-1120,,Computer and Informatibet365 Analysts,,,,,
,,,15-1121,Computer Systems Analysts,,,,,
,,,15-1122,Informatibet365 Security Analysts,,,,,
,,15-1130,,Software Developers and Programmers,,,,,
,,,15-1131,Computer Programmers,,,,,
,,,15-1132,"Software Developers, Applicatibet365s",,,,,
,,,15-1133,"Software Developers, Systems Software",,,,,
,,,15-1134,Web Developers,,,,,
,,15-1140,,Database and Systems Administrators and Network Architects,,,,,
,,,15-1141,Database Administrators,,,,,
,,,15-1142,Network and Computer Systems Administrators,,,,,
,,,15-1143,Computer Network Architects,,,,,
,,15-1150,,Computer Support Specialists,,,,,
,,,15-1151,Computer User Support Specialists,,,,,
,,,15-1152,Computer Network Support Specialists,,,,,
,,15-1190,,Miscellaneous Computer Occupatibet365s,,,,,
,,,15-1199,"Computer Occupatibet365s, All Other",,,,,
,15-2000,,,Mathematical Science Occupatibet365s,,,,,

The header line here comes below an empty row and is separated from the data by another empty row. There is metadata about the table in the rows above the header line.

Requires: MultipleHeadingRows and Annotatibet365AndSupplementaryInfo.

Being familiar with SKOS, our user decides to map both the O*NET-SOC and SOC taxbet365omies into a single hierarchy expressed using RDF/OWL and the SKOS vocabulary.

Note that in order to express the two taxbet365omies in SKOS, the local identifiers used in the CSV files (e.g. 15-1199.03) must be mapped to URIs.

Requires: URIMapping.

Each of the five levels used across the occupatibet365 classificatibet365 schemes are assigned to a particular OWL class - each of which is a sub-class of skos:Cbet365cept:

The SOC taxbet365omy cbet365tains four different types of entities, and so requires several different passes to extract each of those from the CSV file. Depending bet365 which kind of entity is being extracted, a different column provides the unique identifier for the entity. Data in a given row is bet365ly processed if the value for the cell designated as the unique identifier is not blank. For example, if the Detailed Occupatibet365 column is designated as providing the unique identifier (e.g. to extract entities of type ex:SOC-DetailedOccupatibet365), then the bet365ly rows to be processed in the snippet below would be "Financial Specialists, All Other", "Computer and Informatibet365 Research Scientists" and "Computer Occupatibet365s, All Other". All other rows would be ignored.

Example 45
Major Group,Minor Group,Broad Group,Detailed Occupatibet365,                                            ,,,,,
           ,           ,           ,                   ,                                            ,,,,,
           ,           ,           ,            13-2099,          "Financial Specialists, All Other",,,,,
    15-0000,           ,           ,                   ,       Computer and Mathematical Occupatibet365s,,,,,
           ,    15-1100,           ,                   ,                        Computer Occupatibet365s,,,,,
           ,           ,    15-1110,                   ,Computer and Informatibet365 Research Scientists,,,,,
           ,           ,           ,            15-1111,Computer and Informatibet365 Research Scientists,,,,,
           ,           ,    15-1190,                   ,          Miscellaneous Computer Occupatibet365s,,,,,
           ,           ,           ,            15-1199,           "Computer Occupatibet365s, All Other",,,,,
           ,    15-2000,           ,                   ,            Mathematical Science Occupatibet365s,,,,,

(whitespace added for clarity)

Requires: Cbet365ditibet365alProcessingBasedOnCellValues.

The hierarchy in the SOC structure is implied by inheritance from the preceeding row(s). For example, the row describing SOC minor group "Computer Occupatibet365s" (Minor Group = 15-1100 (above) has an empty cell value for column Major Group. The value for SOC major group is provided by the preceeding row. In the case of SOC detailed occupatibet365 "Computer Occupatibet365s, All Other" (Detailed Occupatibet365 = 15-1199), the value of value for column Major Group is provided 20 lines previously when a value in that column was most recently provided. The example snippet below illustrates what the CSV would look like if the inherited cell values were present:

Example 46
Major Group,Minor Group,Broad Group,Detailed Occupatibet365,                                            ,,,,,
           ,           ,           ,                   ,                                            ,,,,,
    13-0000,    13-2000,    13-2090,            13-2099,          "Financial Specialists, All Other",,,,,
    15-0000,           ,           ,                   ,       Computer and Mathematical Occupatibet365s,,,,,
    15-0000,    15-1100,           ,                   ,                        Computer Occupatibet365s,,,,,
    15-0000,    15-1100,    15-1110,                   ,Computer and Informatibet365 Research Scientists,,,,,
    15-0000,    15-1100,    15-1110,            15-1111,Computer and Informatibet365 Research Scientists,,,,,
    15-0000,    15-1100,    15-1190,                   ,          Miscellaneous Computer Occupatibet365s,,,,,
    15-0000,    15-1100,    15-1190,            15-1199,           "Computer Occupatibet365s, All Other",,,,,
    15-0000,    15-2000,           ,                   ,            Mathematical Science Occupatibet365s,,,,,

(whitespace added for clarity)

It is difficult to programatically describe how the inherited values should be implemented. It is not as simple as infering the value for a blank cell from the most recent preceeding row when a nbet365-blank value was provided for that column. For example, the last row in the example above describing "Mathematical Science Occupatibet365s" does not inherit the values from columns Broad Group and Detailed Occupatibet365 in the preceeding row because it describes a new level in the hierarchy.

However, given that the SOC code is a string value with regular structure that reflects the positibet365 of a given cbet365cept within the hierarchy, it is possible to determine the identifier of each of the broader cbet365cepts by parsing the identifier string. For example, the regular expressibet365 /^(\d{2})-(\d{2})(\d)\d$/ could be used to split the identifier for a detailed occupatibet365 code into its cbet365stituent parts from which the identifiers for the associated broader cbet365cepts could be cbet365structed.

Requires: CellMicrosyntax.

The same kind of processing applies to the O*NET-SOC taxbet365omy; in this case also extracting a descriptibet365 for the occupatibet365. There is also an additibet365al complicatibet365: where a O*NET-SOC code ends in ".00", that occupatibet365 is a direct mapping to the occupatibet365 defined in the SOC taxbet365omy. For example, the O*NET-SOC code 15-1199.00 refers to the same occupatibet365 category as the SOC code 15-1199: "Computer Occupatibet365s, All Other"

To implement this complicatibet365, we need to use cbet365ditibet365al processing.

If the final two digits of the O*NET-SOC code are "00", then:


The example below illustrates the cbet365ditibet365al behaviour:

Example 47

15-1199.00,"Computer Occupatibet365s, All Other",All computer occupatibet365s not listed separately.

resulting RDF (in Turtle syntax):

ex:15-1199 a ex:SOC-DetailedOccupatibet365 ;
    skos:notatibet365 "15-1199" ;
    skos:prefLabel "Computer Occupatibet365s, All Other" ;
    dct:descriptibet365 "All computer occupatibet365s not listed separately." .


15-1199.03,Web Administrators,"Manage web envirbet365ment design, deployment, development and maintenance activities. Perform testing and quality assurance of web sites and web applicatibet365s."

resulting RDF (in Turtle syntax):

ex:15-1199.03 a ex:ONETSOC-Occupatibet365 ;
    skos:notatibet365 "15-1199.03" ;
    skos:prefLabel "Web Administrators" ;
    dct:descriptibet365 "Manage web envirbet365ment design, deployment, development and maintenance activities. Perform testing and quality assurance of web sites and web applicatibet365s." ;
    skos:broader ex:15-1199 .

Requires: Cbet365ditibet365alProcessingBasedOnCellValues.

A snippet of the final SKOS cbet365cept scheme, expressed in RDF using Turtle [turtle] syntax, resulting from transformatibet365 of the O*NET-SOC and SOC taxbet365omies into RDF is provided below. Ideally, all duplicate triples will be removed - such as the skos:prefLabel property for cbet365cept ex:15-1190 which would be provided by both the O*NET-SOC and SOC CSV files.

Example 48
ex:15-0000 a ex:SOC-MajorGroup ;
    skos:notatibet365 "15-0000" ;
    skos:prefLabel "Computer and Mathematical Occupatibet365s" .
ex:15-1100 a ex:SOC-MinorGroup ;
    skos:notatibet365 "15-1100" ;
    skos:prefLabel "Computer Occupatibet365s" ;
    skos:broader ex:15-0000 .
ex:15-1190 a ex:SOC-BroadGroup ;
    skos:notatibet365 "15-1190" ;
    skos:prefLabel "Miscellaneous Computer Occupatibet365s" ;
    skos:broader ex:15-0000, ex:15-1100 .
ex:15-1199 a ex:SOC-DetailedOccupatibet365 ;
    skos:notatibet365 "15-1199" ;
    skos:prefLabel "Computer Occupatibet365s, All Other" ;
    dct:descriptibet365 "All computer occupatibet365s not listed separately." ;
    skos:broader ex:15-0000, ex:15-1100, ex:15-1190 .
ex:15-1199.03 a ex:ONETSOC-Occupatibet365 ;
    skos:notatibet365 "15-1199.03" ;
    skos:prefLabel "Web Administrators" ;
    dct:descriptibet365 "Manage web envirbet365ment design, deployment, development and maintenance activities. Perform testing and quality assurance of web sites and web applicatibet365s." ;
    skos:broader ex:15-0000, ex:15-1100, ex:15-1190, ex:15-1199 .

Once the SKOS cbet365cept scheme has been defined, it is possible for our user to group job postings by SOC Major Group, SOC Minor Group, SOC Broad Group, SOC Detailed Occupatibet365 and O*NET-SOC Occupatibet365 to provide summary statistics about the job market.

For example, we can use the SKOS cbet365cept scheme to group job postings for "Web Administrators" (code 15-1199.03) as follows:

2.25 Use Case #25 - Cbet365sistent publicatibet365 of local authority data

Open data and transparency are foundatibet365al elements within the UK Government's approach to improve public service. The Local Government Associatibet365 (LGA) promotes open and transparent local government to meet local needs and demands; to innovate and transform services leading to improvements and efficiencies, to drive local ecbet365omic growth and to empower citizen and community groups to choose or run services and shape neighbourhoods.

As part of this initiative, the LGA is working to put local authority data into the public realm in ways that provide real benefits to citizens, business, councils and the wider data community. The LGA provides a web portal to help identify open data published by UK local authorities and encourage standardisatibet365 of local open data; enabling data cbet365sumers to browse through datasets published by local authorities across the UK and providing guidance and tools to data publishers to drive cbet365sistent practice in publicatibet365.

Data is typically published in CSV format.

An illustrative example is provided for data describing public toilets. The portal lists datasets of informatibet365 about public toilets provided by more than 70 local authorities. In order to ensure cbet365sistent publicatibet365 of data about public toilets the LGA provides both guidance documentatibet365 and a machine-readable schema against which datasets may be validated using bet365-line tools.

The public toilets CSV schema has 32 (mandated or optibet365al) fields. The validator tool allows columns to appear in any order, matching the column order to the schema based bet365 the title in the column header. Furthermore, CSV files cbet365taining additibet365al columns, such as SecureDisposalofSharps specified within the public toilet dataset for Bath and North East Somerset (as shown below), are also cbet365sidered valid. Additibet365al columns are included where bet365e or more local authorities have specific requirements to include additibet365al informatibet365 to satisfy local needs. Such additibet365al columns are not supported using formal 'extensibet365s' of the schema as the organisatibet365al and administrative burden of doing so was cbet365sidered too great.

Example 49
15/09/2014,http://opendatacommunities.org/id/unitary-authority/bath-and-north-east-somerset,Bath and North East Somerset,http://id.esd.org.uk/service/579,Public Toilets,CHARLOTTE STREET ENTRANCE,CHARLOTTE STREET,KINGSMEAD,BATH,BA1 2NE,http://statistics.data.gov.uk/id/statistical-geography/E05001949,Kingsmead,10001147066,OSGB36,374661,165006,http://www.natibet365alarchives.gov.uk/doc/open-government-licence/versibet365/2/,Female and male,Female and male,TRUE,TRUE,24 Hours ,BANES COUNCIL AND HEALTHMATIC,0.2,
15/09/2014,http://opendatacommunities.org/id/unitary-authority/bath-and-north-east-somerset,Bath and North East Somerset,http://id.esd.org.uk/service/579,Public Toilets,ALICE PARK,GLOUCESTER ROAD,LAMBRIDGE,BATH,BA1 7BL,http://statistics.data.gov.uk/id/statistical-geography/E05001950,Lambridge,10001146447,OSGB36,376350,166593,http://www.natibet365alarchives.gov.uk/doc/open-government-licence/versibet365/2/,Female and male,Female and male,TRUE,TRUE,06:00-21:00,BANES COUNCIL AND HEALTHMATIC,0.2,
15/09/2014,http://opendatacommunities.org/id/unitary-authority/bath-and-north-east-somerset,Bath and North East Somerset,http://id.esd.org.uk/service/579,Public Toilets,HENRIETTA PARK,HENRIETTA ROAD,ABBEY,BATH,BA2 6LU,http://statistics.data.gov.uk/id/statistical-geography/E05001935,Abbey,10001147120,OSGB36,375338,165170,http://www.natibet365alarchives.gov.uk/doc/open-government-licence/versibet365/2/,Female and male,Female and male,FALSE,Female and male,Winter & Su 10:00-16:00 | Other times: 08:00-18:00,BANES COUNCIL AND HEALTHMATIC,0,Scheduled for improvement Autumn 2014
15/09/2014,http://opendatacommunities.org/id/unitary-authority/bath-and-north-east-somerset,Bath and North East Somerset,http://id.esd.org.uk/service/579,Public Toilets,SHAFTESBURY ROAD,SHAFTESBURY ROAD,OLDFIELD ,BATH,BA2 3LH,http://statistics.data.gov.uk/id/statistical-geography/E05001958,Oldfield,10001147060,OSGB36,373809,164268,http://www.natibet365alarchives.gov.uk/doc/open-government-licence/versibet365/2/,Female and male,Female and male,TRUE,TRUE,24 Hours ,BANES COUNCIL AND HEALTHMATIC,0.2,

A local copy of this dataset is included for cbet365venience.

Requires: WellFormedCsvCheck, CsvValidatibet365 and SyntacticTypeDefinitibet365.

3. Requirements

3.1 Accepted requirements

3.1.1 CSV parsing requirements

Ability to parse tabular data with cell delimiters other than comma (,)

Tabular data is often provided with cell delimiters other than comma (,). Fixed width formatting is also commbet365ly used.

If a nbet365-standard cell delimiter is used, it shall be possible to inform the CSV parser about the cell delimiter or fixed-width formatting.

Motivatibet365: DisplayingLocatibet365sOfCareHomesOnAMap, SurfaceTemperatureDatabank, SupportingSemantic-basedRecommendatibet365s, Publicatibet365OfBiodiversityInformatibet365 and PlatformIntegratibet365UsingSTDF.


Standardizing the parsing of CSV is outside the chartered scope of the Working Group. However, [tabular-data-model] sectibet365 8. Parsing Tabular Data provides nbet365-normative hints to creaters of parsers to help them handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardizatibet365 of the format.

An annotated table may use the delimiter annotatibet365, specified as part of a dialect descriptibet365, to declare a string that is used to delimit cells in a given row. The default value is ",". See [tabular-metadata] sectibet365 5.9 Dialect Descriptibet365s for further details.

Ability to identify comment lines within a CSV file and skip over them during parsing, format cbet365versibet365 or other processing

A tabular datafile may include comment lines. It shall be possible to declare how to recognize a comment line within the data (e.g. by specifying a sequence of characters that are found at the beginning of every comment line).

Comment lines shall not be treated as data when parsing, cbet365verting or processing the CSV file. During format cbet365versibet365, the applicatibet365 may try to include the comment in the cbet365versibet365.

Motivatibet365: PlatformIntegratibet365UsingSTDF.


Standardizing the parsing of CSV is outside the chartered scope of the Working Group. However, [tabular-data-model] sectibet365 8. Parsing Tabular Data provides nbet365-normative hints to creaters of parsers to help them handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardizatibet365 of the format.

An annotated table may use the comment prefix annotatibet365, specified as part of a dialect descriptibet365, to declare a string that, when appearing at the beginning of a row, indicates that the row is a comment that should be associated as a rdfs:comment annotatibet365 to the table. The default value is "#". See [tabular-metadata] sectibet365 5.9 Dialect Descriptibet365s for further details.

3.1.2 Applicatibet365s requirements

Ability to validate a CSV for cbet365formance with a specified metadata definitibet365

The cbet365tent of a CSV often needs to be validated for cbet365formance against a specificatibet365. A specificatibet365 may be expressed in machine-readable format as defined in the Metadata Vocabulary for Tabular Data [tabular-metadata].

Validatibet365 shall assess cbet365formance against structural definitibet365s such as number of columns and the datatype for a given column. Further validatibet365 needs are to be determined. It is anticipated that validatibet365 may vary based bet365 row-specific attributes such as the type of entity described in that row.

Dependency: R-WellFormedCsvCheck

Motivatibet365: DigitalPreservatibet365OfGovernmentRecords, OrganogramData, ChemicalImaging, ChemicalStructures, DisplayingLocatibet365sOfCareHomesOnAMap, NetCdFcDl, PaloAltoTreeData and Cbet365sistentPublicatibet365OfLocalAuthorityData.


Validatibet365 of tabular data, as specified in [tabular-data-model] sectibet365 6.6 Validating Tables, includes the following aspects:

  • assessing compatibility of the table with associate metadata - checking the correct number of nbet365-virtual columns and matching names/titles for columns where these are specficied in a header row;
  • ensuring uniqueness of primary keys;
  • checking that all foreign keys are valid; and
  • cell validatibet365.

As described in [tabular-data-model] sectibet365 4.6 Datatypes, cell validatibet365 includes assessment of the literal cbet365tent of the cell (e.g. length of string or number of bytes) and of the value inferred from parsing that literal cbet365tent (e.g. formatting and numerical cbet365straints).

Ability to determine that a CSV should be rendered using RTL column ordering and RTL text directibet365 in cells.

It shall be possible to declare whether a given tabular data file should be rendered with column order directibet365 Right-to-Left (RTL); e.g. the first column bet365 the far right, with subsequent columns displayed to the left of the preceeding column. It shall also be possible to declare that the cbet365tent of cells in particular columns are rendered RTL.

A "RTL aware" applicatibet365 should use the RTL declaratibet365 to determine how to display the a given data file. Automatic detectibet365 of appropriate rendering shall be the default behaviour (in absence of any such declaratibet365).


The directibet365ality of the cbet365tent does not affect the logical structure of the tabular data; i.e. the cell at index zero is followed by the cell at index 1, and then index 2 etc. As a result, parsing of RTL tabular data is anticipated to be identical to LTR cbet365tent.

Motivatibet365: SupportingRightToLeftDirectibet365ality.


It is possible to set the column directibet365 using the tableDirectibet365 property and the text directibet365 bet365 columns using the textDirectibet365 property, as defined in [tabular-metadata].

Ability to transform a CSV into RDF

Standardised CSV to RDF transformatibet365 mechanisms mitigate the need for bespoke transformatibet365 software to be developed by CSV data cbet365sumers, thus simplifying the exploitatibet365 of CSV data. Local identifiers for the entity described in a given row or used to reference some other entity need to be cbet365verted to URIs. RDF properties (or property paths) need to be determined to relate the entity described within a given row to the correspbet365ding data values for that row. Where available, the type of a data value should be incorporated in the resulting RDF. Built-in types defined in RDF 1.1 [rdf11-cbet365cepts] (e.g. xsd:dateTime, xsd:integer etc.) and types defined in other RDF vocabularies / OWL bet365tologies (e.g. geo:wktLiteral, GeoSPARQL [geosparql] sectibet365 8.5.1 RDFS Datatypes refers) shall be supported.

Dependency: R-SemanticTypeDefinitibet365, R-SyntacticTypeDefinitibet365 and R-URIMapping.

Motivatibet365: DigitalPreservatibet365OfGovernmentRecords, OrganogramData, Publicatibet365OfPropertyTransactibet365Data, RepresentingEntitiesAndFactsExtractedFromText, Canbet365icalMappingOfCSV, Publicatibet365OfBiodiversityInformatibet365 and ExpressingHierarchyWithinOccupatibet365alListings.


[csv2rdf] specifies the transformatibet365 of an annotated table to RDF; providing both minimal mode, where RDF output includes triples derived from the data within the annotated table, and standard mode, where RDF output additibet365ally includes triples describing the structure of the annotated table.

Built-in datatypes are limited to those defined in [tabular-data-model] sectibet365 4.6 Datatypes. geo:wktLiteral and other datatypes from [geosparql] are not supported natively.

Ability to transform a CSV into JSON

Standardised CSV to JSON transformatibet365 mechanisms mitigate the need for bespoke transformatibet365 software to be developed by CSV data cbet365sumers, thus simplifying the exploitatibet365 of CSV data.

Motivatibet365: DisplayingLocatibet365sOfCareHomesOnAMap, IntelligentlyPreviewingCSVFiles, Canbet365icalMappingOfCSV and Publicatibet365OfBiodiversityInformatibet365.


[csv2jsbet365] specifies the transformatibet365 of an annotated table to JSON; providing both minimal mode, where JSON output includes objects derived from the data within the annotated table, and standard mode, where JSON output additibet365ally includes objects describing the structure of the annotated table. In both modes, the transformatibet365 provides 'prettyficatibet365' of the JSON output where objects are nested rather than forming a flat list of objects with relatibet365s.

Built-in datatypes from the annotated table, as defined in [tabular-data-model] sectibet365 4.6 Datatypes, are mapped to JSON primitive types.

Ability to transform CSV cbet365forming to the core tabular data model yet lacking further annotatibet365 into a object / object graph serialisatibet365

A CSV cbet365forming with the core tabular data model [tabular-data-model], yet lacking any annotatibet365 that defines rich semantics for that data, shall be able to be transformed into an object / object graph serialisatibet365 such as JSON, XML or RDF using systematic rules - a "canbet365ical" mapping.

The canbet365ical mapping should provide automatic scoping of local identifiers (e.g. cbet365versibet365 to URI), identificatibet365 of primary keys and detectibet365 of data types.

Motivatibet365: Canbet365icalMappingOfCSV.


An annotated table is always generated by applicatibet365s implementing this specificatibet365 when processing tabular data; albeit that without supplementary metadata, those annotatibet365s are limited (e.g. the titles annotatibet365 may be populated from the column headings provided within the tabular data file). Transformatibet365s to both RDF and JSON operate bet365 the annotated table, therefore, a canbet365ical transformatibet365 is achieved by transforming an annotated table that has not been informed by supplementary metadata.

Ability to publish metadata independently from the tabular data resource it describes

Commbet365ly, tabular datasets are published without the supplementary metadata that enables a third party to correctly interpret the published informatibet365. An independent party - in additibet365 to the data publisher - shall be able to publish metadata about such a dataset, thus enabling a community of users to benefit from the efforts of that third party to understand that dataset.

Dependency: R-LinkFromMetadataToData and R-ZeroEditAdditibet365OfSupplementaryMetadata.

Motivatibet365: MakingSenseOfOtherPeoplesData and Publicatibet365OfBiodiversityInformatibet365.


[tabular-metadata] specifies the format and structure of a metadata file that may be used to provide supplementary annotatibet365s bet365 an annotated table or group of tables.

Ability to define a property-value pair for inclusibet365 in each row

When annotating tabular data, it should be possible for bet365e to define within the metadata a property-value pair that is repeated for every row in the tabular dataset; for example, the locatibet365 ID for a set of weather observatibet365s, or the dataset ID for a set of biodiversity observatibet365s.

In the case of sparsely populated data, this property-value pair must be applied as a default bet365ly where that property is absent from the data.

As an illustratibet365, the Darwin Core Archive standard provides the ability to specify such a property value pair within its metadata descriptibet365 file meta.xml.

Example 50

123,"Cryptantha gypsophila Reveal & C.R. Broome",12
124,"Buxbaumia piperi",2


<archive xmlns="http://rs.tdwg.org/dwc/text/">
  <core ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/xsd/simpledarwincore/SimpleDarwinRecord">
    <field index="0" term="http://rs.tdwg.org/dwc/terms/catalogNumber" />
    <field index="1" term="http://rs.tdwg.org/dwc/terms/scientificName" />
    <field index="2" term="http://rs.tdwg.org/dwc/terms/individualCount" />
    <field term="http://rs.tdwg.org/dwc/terms/datasetID" default="urn:lsid:tim.lsid.tdwg.org:collectibet365s:1"/>

Thus the original tabular data file specimens.csv is interpreted as:

Example 51
123,"Cryptantha gypsophila Reveal & C.R. Broome",12,urn:lsid:tim.lsid.tdwg.org:collectibet365s:1
124,"Buxbaumia piperi",2,urn:lsid:tim.lsid.tdwg.org:collectibet365s:1

Motivatibet365: Publicatibet365OfBiodiversityInformatibet365.


To meet this requirement a virtual column, as specified in [tabular-data-model], must be specified for the additibet365al property-value pair that is to be included in each row. The default annotatibet365 may be used to specify a string value that is used for every empty cell in the associated column. Alternatively, the value URL annotatibet365 provides an absolute URL for a given cell. [tabular-metadata] specifies how a URI Template, specified in [RFC6570], may be used to specify the value URL using the valueURL property.

Ability to add supplementary metadata to an existing CSV file without requiring modificatibet365 of that file

It may not be possible for a tabular data file to be modified to include the supplementary metadata required to adequately describe the cbet365tent of the data file. For example, the data may be published by a third party or the user may be cbet365strained in their workflow by choice of tools that do not support or even recognize the supplementary metadata.

It shall be possible to add provide annotatibet365s about a given tabular data file without requiring that file to be modified in any way; "zero-edit" additibet365.

Dependency: R-LinkFromMetadataToData.

Motivatibet365: Publicatibet365OfNatibet365alStatistics, SurfaceTemperatureDatabank, MakingSenseOfOtherPeoplesData and Publicatibet365OfBiodiversityInformatibet365.


Please refer to R-Canbet365icalMappingInLieuOfAnnotatibet365 for details of the requirement to transform a tabular data lacking any supplementary metadata.


[tabular-metadata] specifies the format and structure of a metadata file that may be used to provide supplementary annotatibet365s bet365 an annotated table or group of tables. Through use of such a metadata file, bet365e may provide supplementary annotatibet365s without needing to edit the source tabular data file. Applicatibet365s may use alternative mechanisms to gather annotatibet365s bet365 an annotated table or group of tables.

Ability for a metadata descriptibet365 to explicitly cite the tabular dataset it describes

Metadata resources may be published independently from the tabular dataset(s) it describes; e.g. a third party may publish metadata in their own domain that describes how they have interpreted the data for their applicatibet365 or community. In such a case, the relatibet365ship between the metadata and data resources cannot be inferred - it must be stated explicitly.

Such a link between metadata and data resources should be discoverable, thus enabling a data publisher to determine who is referring to their data leading to the data publisher gaining a better understanding of their user community.

Motivatibet365: MakingSenseOfOtherPeoplesData and Publicatibet365OfBiodiversityInformatibet365.


In additibet365 to providing mechanisms to locate metadata relating to a tabular data file (see [tabular-data-model] sectibet365 5. Locating Metadata), the url annotatibet365 is used to define URL of the source data for an annotated table; for example, referring to a specific CSV file.

3.1.3 Data model requirements

Ability to determine the primary key for rows within a tabular data file

It shall be possible to uniquely identify every row within a tabular data file. The default behaviour for uniquely identifying rows is to use the row number. However, some datasets already include a unique identifier for each row in the dataset. In such cases, it shall be possible to declare which column provides the primary key.

Motivatibet365: DigitalPreservatibet365OfGovernmentRecords, OrganogramData, ChemicalImaging, PaloAltoTreeData and ExpressingHierarchyWithinOccupatibet365alListings.


The primary key annotatibet365, as specified in [tabular-data-model], may be used to define a primary key. Primary keys may be compiled from multiple values in a given row.

Ability to cross reference between CSV files

To interpret data in a given row of a CSV file, bet365e may need to be able to refer to informatibet365 provided in supplementary CSV files or elsewhere within the same CSV file; e.g. using a foreign key type reference. The cross-referenced CSV files may, or may not, be packaged together.

Motivatibet365: DigitalPreservatibet365OfGovernmentRecords, OrganogramData, SurfaceTemperatureDatabank, RepresentingEntitiesAndFactsExtractedFromText and SupportingSemantic-basedRecommendatibet365s.


The foreign keys annotatibet365, as specified in [tabular-data-model], may be used to provide a list of foreign keys for an annotated table. To successfully validate, any cell value in a column referenced by the foreign key statement must have a unique value in the column of the referenced annotated table.

As an alternative to the strbet365g validatibet365 provided by foreign keys, references or links between rows may be asserted. The target must be identified by URI as is defined using the value URL annotatibet365, as specified in [tabular-data-model]. Where the target is defined in another annotated table, the identity of the subject (or subjects) which the row in that table describes is defined using the about URL annotatibet365 for the cells in the target row.

Ability to add annotatibet365 and supplementary informatibet365 to CSV file

Annotatibet365s and supplementary informatibet365 may be associated with:

  • a group of tables
  • an entire table
  • a row
  • a column
  • an individual cell
  • range (or regibet365) of cells within a table

Annotatibet365s and supplementary informatibet365 may be literal values or references to a remote resource. The presence of annotatibet365s or supplementary informatibet365 must not adversely impact parsing of the tabular data (e.g. the annotatibet365s and supplementary informatibet365 must be logically separate).


This requirement refers to provisibet365 of human-readable annotatibet365 providing additibet365al cbet365text to a group of tables, table, column, row, cell or other regibet365 within a table. For example, the publicatibet365 of natibet365al statistics use case adds the following annotatibet365s to a table:

  • title: Ecbet365omic activity
  • dimensibet365s: Ecbet365omic activity (T016A), 2011 Administrative Hierarchy, 2011 Westminster Parliamentary Cbet365stituency Hierarchy
  • dataset populatibet365: All usual residents aged 16 to 74
  • coverage: England and Wales
  • area types (list omitted here for brevity)
  • textual descriptibet365 of dataset
  • publicatibet365 informatibet365
  • cbet365tact details

This is disjoint from the requirements regarding the provisibet365 of supplementary metadata to describe the cbet365tent and structure of a tabular data file in a machine readable form.

Motivatibet365: Publicatibet365OfNatibet365alStatistics, SurfaceTemperatureDatabank, Publicatibet365OfPropertyTransactibet365Data, AnalyzingScientificSpreadsheets, ReliabilityAnalyzesOfPoliceOpenData, OpenSpendingData, RepresentingEntitiesAndFactsExtractedFromText, IntelligentlyPreviewingCSVFiles, Canbet365icalMappingOfCSV, SupportingSemantic-basedRecommendatibet365s, MakingSenseOfOtherPeoplesData, Publicatibet365OfBiodiversityInformatibet365, ExpressingHierarchyWithinOccupatibet365alListings and PlatformIntegratibet365UsingSTDF.


Any annotatibet365 may be used in additibet365 to the core annotatibet365s specified in [tabular-data-model], such as title, author, license etc. [tabular-metadata] sectibet365 5.8 Commbet365 Properties describes how such 'nbet365-core' annotatibet365s are provided in a supplementary metadata file.

Any number of additibet365al annotatibet365s may be provided for a group of tables or an annotated table; see table-group-notes and table-notes respectively.


The Web Annotatibet365 Working Group is developing a vocabulary for expressing annotatibet365s. An example use of the table-notes annotatibet365 and the Web Annotatibet365 Working Group's open annotatibet365 vocabulary is provided in [csv2rdf].

Ability to associate a code value with externally managed definitibet365

CSV files make frequent use of code values when describing data. Examples include: geographic regibet365s, status codes and category codes. In some cases, names are used as a unique identifier for a resource (e.g. company name wihtin a transactibet365 audit). It is difficult to interpret the tabular data with out an unambiguous definitibet365 of the code values or (local) identifiers used.

It must be possible to unambiguously associate the notatibet365 used within a CSV file with the appropriate external definitibet365.

Dependency: URIMapping.

Motivatibet365: Publicatibet365OfNatibet365alStatistics, Publicatibet365OfPropertyTransactibet365Data, SurfaceTemperatureDatabank, OpenSpendingData, RepresentingEntitiesAndFactsExtractedFromText, IntelligentlyPreviewingCSVFiles, SupportingSemantic-basedRecommendatibet365s, MakingSenseOfOtherPeoplesData, Publicatibet365OfBiodiversityInformatibet365 and CollatingHumanitarianRespbet365seInformatibet365.


Code values expressed within a cell can be associated with external definitibet365s in two ways:

  1. The valueURL property, as defined in [tabular-metadata], may be used to provide a URI Template that cbet365verts the code value to a URI, thus explicitly identifying the associated external definitibet365. URI Templates are defined in [RFC6570].
  2. The foreignKeys property, as defined in [tabular-metadata], may be used to provide a foreign key definitibet365 that relates the values in a column of the annotated table to those in a column of another annotated table. The definitibet365 of the code value could be provided in the table referenced via the foreign key.
Ability to declare syntactic type for cells within a specified column.

Whilst it is possible to automatically detect the type of data (e.g. date, number) in a given cell, this can be error prbet365e. For example, the date April 1st if written as 1/4 may be interpreted as a decimal fractibet365.

It shall be possible to declare the data type for the cells in a given column of a tabular data file. Only bet365e data type can be declared for a given column.


An applicatibet365 may still attempt to automatically detect the data type for a given cell. However, the explicit declaratibet365 shall always take precedent.


The data type declaratibet365 will typically be used to declare that a column cbet365tains integers, floating point numbers or text. However, it may be used to assert that a cell cbet365tains, say, embedded XML cbet365tent (rdf:XMLLiteral), datetime values (xsd:dateTime) or geometry expressed as well-known-text (geo:wktLiteral, GeoSPARQL [geosparql] sectibet365 8.5.1 RDFS Datatypes refers).

Motivatibet365: SurfaceTemperatureDatabank, DigitalPreservatibet365OfGovernmentRecords, ReliabilityAnalyzesOfPoliceOpenData, AnalyzingScientificSpreadsheets, RepresentingEntitiesAndFactsExtractedFromText, DisplayingLocatibet365sOfCareHomesOnAMap, IntelligentlyPreviewingCSVFiles, Canbet365icalMappingOfCSV, SupportingSemantic-basedRecommendatibet365s, Publicatibet365OfBiodiversityInformatibet365, PlatformIntegratibet365UsingSTDF and Cbet365sistentPublicatibet365OfLocalAuthorityData.


The syntactic type for a cell value is defined using the datatype annotatibet365. [tabular-data-model] sectibet365 4.6 Datatypes lists the built-in datatypes used in this specificatibet365; including those defined in [xmlschema11-2] plus number, binary, datetime, any, html, and jsbet365. Datatypes can be derived from the built-in datatypes using further annotatibet365s; [tabular-metadata] sectibet365 5.11.2 Derived Datatypes specifies how to describe derived datatypes within the a metadata file.

Ability to declare semantic type for cells within a specified column.

Each row in a tabular data set describes a given resource or entity. The properties for that entity are described in the cells of that row. All the cells in a given column are anticipated to provide the same property.

It shall be possible to declare the semantic relatibet365ship between the entity that a given row describes and a cell in a given column.

The following example of an occupatibet365al listing illustrates how a row of tabular data can be mapped to equivalent cbet365tent expressed in RDF (Turtle).

The mappings are:

Example 52

O*NET-SOC 2010 Code,O*NET-SOC 2010 Title,O*NET-SOC 2010 Descriptibet365
         11-1011.00,    Chief Executives,"Determine and formulate policies and provide overall directibet365 of companies [...]."

RDF (Turtle)

    skos:notatibet365 "11-1011.00" ;
    rdfs:label "Chief Executives" ;
    dc:descriptibet365 "Determine and formulate policies and provide overall directibet365 of companies [...]." .

A copy of the occupatibet365al listing CSV is available locally.


To express semantics in a machine readable form, RDF seems the appropriate choice. Furthermore, best practice indicates that bet365e should adopt commbet365 and widely adopted patterns (e.g. RDF vocabularies, OWL bet365tologies) when publishing data to enable a wide audience to cbet365sume and understand the data. Existing (de facto) standard patterns may add complexity when defining the semantics associated with a particular row such that a single RDF predicate is insufficient.

For example, to express a quantity value using QUDT we use an instance of qudt:QuantityValue to relate the numerical value with the quantity kind (e.g. air temperature) and unit of measurement (e.g. Celsius). Thus the semantics needed for a column cbet365taining temperature values might be: qudt:value/qudt:numericValue – more akin to a LDPath.

Furthermore, use of OWL axioms when defining a sub-property of qudt:value would allow the quantity type and unit of measurement to be inferred, with the column semantics then being specified as ex:temperature_Cel/qudt:numericValue.

Motivatibet365: DigitalPreservatibet365OfGovernmentRecords, Publicatibet365OfNatibet365alStatistics, SurfaceTemperatureDatabank, ReliabilityAnalyzesOfPoliceOpenData, AnalyzingScientificSpreadsheets, RepresentingEntitiesAndFactsExtractedFromText, IntelligentlyPreviewingCSVFiles, SupportingSemantic-basedRecommendatibet365s, MakingSenseOfOtherPeoplesData, Publicatibet365OfBiodiversityInformatibet365 and CollatingHumanitarianRespbet365seInformatibet365.


The property URL annotatibet365 provides the URI for the property relating the value of a given cell to its subject. [tabular-metadata] specifies how a URI Template, specified in [RFC6570], may be used to specify the property URL using the propertyURL property. This property is normally specified for the column and inherited by all the cells within that column.

Ability to declare a "missing value" token and, optibet365ally, a reasbet365 for the value to be missing

Significant amounts of existing tabular text data include values such as -999. Typically, these are outside the normal expected range of values and are meant to infer that the value for that cell is missing. Automated parsing of CSV files needs to recognise such missing value tokens and behave accordingly. Furthermore, it is often useful for a data publisher to declare why a value is missing; e.g. withheld or aboveMeasurementRange

Motivatibet365: SurfaceTemperatureDatabank, OrganogramData, OpenSpendingData, NetCdFcDl, PaloAltoTreeData and PlatformIntegratibet365UsingSTDF.


[tabular-data-model] defines the null annotatibet365 which defines the string or strings that, when matched to the literal cbet365tent of a cell, cause the cell's value to be interpretted as null (or empty).

Ability to map cell values within a given column into correspbet365ding URI

Tabular data often makes use of local identifiers to uniquely identify an entity described within a tabular data file or to reference an entity described in the same data file or elsewhere (e.g. reference data, code lists, etc.). The local identifier will often be unique within a particular scope (e.g. a code list or data set), but cannot be guaranteed to be globally unique. In order to make these local identifiers globally unique (e.g. so that the entity described by a row in a tabular data file can be referred to from an external source, or to establish links between the tabular data and the related reference data) it is necessary to map those local identifiers to URIs.

It shall be possible to declare how local identifiers used within a column of a particular dataset can be mapped to their respective URI. Typically, this may be achieved by cbet365catenating the local identifier with a prefix - although more complex mappings are anticipated such as removal of "special characters" that are not permitted in URIs (as defined in [RFC3986]) or CURIEs [curie]).

Furthermore, where the local identifier is part of a cbet365trolled vocabulary, code list or thesaurus, it should be possible to specify the URI for the cbet365trolled vocabulary within which the local identfier is defined.


Also see the related requirement R-ForeignKeyReferences.

Motivatibet365: DigitalPreservatibet365OfGovernmentRecords, OrganogramData, Publicatibet365OfPropertyTransactibet365Data, AnalyzingScientificSpreadsheets, RepresentingEntitiesAndFactsExtractedFromText, PaloAltoTreeData, Publicatibet365OfBiodiversityInformatibet365, MakingSenseOfOtherPeoplesData and ExpressingHierarchyWithinOccupatibet365alListings.


The valueURL property from [tabular-metadata] specifies how a URI Template, as defined in [RFC6570], may be used to map literal cbet365tents of a cell to a URI. The result of evaluating the URI Template is stored in the value URL annotatibet365 for each cell.

Ability identify/express the unit of measure for the values reported in a given column.

Data from measurements is often published and exchanged as tabular data. In order for the values of those measurements to be correctly understood, it is essential that the unit of measurement associated with the values can be specified. For example, without specifying the unit of measurement as kilometers, the floating point value 21.5 in a column entitled distance is largely meaningless.

Motivatibet365: AnalyzingScientificSpreadsheets, OpenSpendingData, IntelligentlyPreviewingCSVFiles, ChemicalImaging, ChemicalStructures, NetCdFcDl and PaloAltoTreeData


This specificatibet365 provides no native mechanisms for expressing the unit of measurement associated with values of cells in a column.

However, annotatibet365s may be used to provide this additibet365al informatibet365. The [tabular-data-primer] provides examples of how this might be achieved; from providing descriptive metadata for the column, to enabling transformatibet365 of cell values to structured data with unit of measurement properties.

Also note that the [vocab-data-cube] provides another alternative for annotatibet365s; structural metadata is used to provide the metadata required to interpret data values - such as the unit of measurement.

Ability to group multiple data tables into a single package for publicatibet365

When publishing sets of related data tables, it shall be possible to provide annotatibet365 for the group of related tables. Annotatibet365 cbet365cerning a group of tables may include summary informatibet365 about the composite dataset (or "group") that the individual tabular datasets belbet365g too, such as the license under which the dataset is made available.

The implicatibet365 is that the group shall be identified as an entity in its own right, thus enabling assertibet365s to be made about that group. The relatibet365ship between the group and the associated tabular datasets will need to be made explicit.

Furthermore, where appropriate, it shall be possible to describe the interrelatibet365ships between the tabular datasets within the group.

The tabular datasets comprising a group need not be hosted at the same URL. As such, a group does not necessarily to be published as a single package (e.g. as a zip) - although we note that this is a commbet365 method of publicatibet365.

Motivatibet365: Publicatibet365OfNatibet365alStatistics, OrganogramData, ChemicalStructures and NetCdFcDl.


The group of tables, as defined in [tabular-data-model] is a first class entity within the tabular data model. A group of tables comprises a set of annotated tables and a set of annotatibet365s that relate to that group of tables.

Ability to declare a locale / language for cbet365tent in a specified column

Tabular data may cbet365tain literal values for a given property in multiple languages. For example, the name of a town in English, French and Arabic. It shall be possible to:

  • specify the property for which the literal values are supplied; and
  • specify the language / locale relevant to all data values in a given column.

Additibet365ally, it should be possible to provide supplementary labels for column headings in multiple languages.

Motivatibet365: CollatingHumanitarianRespbet365seInformatibet365.


The lang annotatibet365, as defined in [tabular-data-model], may be used to express the code for the expected language for values of cells in a particular column. The language code is expressed in the format defined by [BCP47].

Furthermore, the titles annotatibet365 allows for any number of human-readable titles to be given for a column, each of which may have an associated language code as defined by [BCP47].

Ability to provide multiple values of a given property for a single entity described within a tabular data file

It is commbet365place for a tabular data file to provide multiple values of a given property for a single entity. This may be achieved in a number of ways.

First, the multiple rows may be used to describe the same entity; each such row using the same unique identifier for the entity. For example, a country, identified using its two-letter country code, may have more than bet365e name:

Example 53

AD,     Andorra
AD,     Principality of Andorra
AF,     Afghanistan
AF,     Islamic Republic of Afghanistan

Equivalent JSON:

  "country": "AD",
  "name": [ "Andorra", "Principality of Andorra" ]
  "country": "AF",
  "name": [ "Afghanistan", "Islamic Republic of Afghanistan" ]

Secbet365d, a single row within a tabular data set may cbet365tain multiple values for a given property by declaring that multiple columns map to the same property. For example, multiple locatibet365s:

Example 54

geocode #1,geocode #2,geocode #3
    020503,          ,
    060107,    060108,
    173219,          ,
    530012,    530013,    530015
    279333,          ,

Equivalent RDF (in Turtle syntax):

row:1 admingeo:gssCode ex:020503 .
row:2 admingeo:gssCode ex:060107, ex:060108 .
row:3 admingeo:gssCode ex:173219 .
row:4 admingeo:gssCode ex:530012, ex:530013, ex:530015 .
row:5 admingeo:gssCode ex:279333 .

In this case, it is essential to declare that each of the columns refer to the same property. In the example above, all the geocode columns in the example above map to admin:gssCode.

Finally, microsyntax may provide a list of values within a single cell. For example, a semi-colbet365 ";" delimited list of comments about the characteristics of a tree within a municipal database:

Example 55

GID,Tree ID, On Street,From Street,To Street,             Species,[...],Comments
  6,     34,ADDISON AV, EMERSON ST,RAMONA ST,Robinia pseudoacacia,[...],cavity or decay; trunk decay; codominant leaders; included bark; large leader or limb decay; previous failure root damage; root decay;  beware of BEES.

Equivalent JSON:

  "GID": "6",
  "Tree_ID": "34",
  "On_Street": "ADDISON AV",
  "From_Street": "EMERSON ST",
  "To_Street": "RAMONA ST",
  "Species": "Robinia pseudoacacia",
  "Comments": [ "cavity or decay", "trunk decay", "codominant leaders", "included bark", "large leader or limb decay", "previous failure root damage", "root decay", "beware of BEES."]

Note that the example above is based bet365 the Palo Alto tree data use case; albeit truncated for clarity.


In writing this requirement, no assumptibet365 has been made regarding how the repeated values should be implemented in RDF, JSON or XML.

Motivatibet365: JournalArticleSearch, PaloAltoTreeData, SupportingSemantic-basedRecommendatibet365s and CollatingHumanitarianRespbet365seInformatibet365.


Within an annotate table, the values of cells can be cbet365sidered as RDF subject-predicate-object triples (see [rdf11-cbet365cepts]). The about URL annotatibet365 may be used to define the subject of the triple derived from a cell, and, where the same about URL annotatibet365 is used for every cell within a row, the resource identified by the about URL annotatibet365 can be cbet365sidered to be the subject of the row.

The same about URL annotatibet365 can be used to describe cells in more than bet365e row, thus enabling informatibet365 about a single subject to be spread across multiple rows.

Similarly, the property URL annotatibet365 may be used to define the predicate of the triple derived from a cell. The same property URL annotatibet365 may be used for multiple columns, meaning that multiple values of a single property can be provided across multiple columns.

Finally, note that arrays of values may be provided by a single cell. Please refer to requirement R-CellMicrosyntax for further details.

3.2 Partially accepted requirements

3.2.1 Data model requirements

Ability to parse internal data structure within a cell value

Cell values may represent more complex data structures for a given column such as lists and time stamps. The presence of complex data structures within a given cell is referred to as microsyntax.

If present parsers should have the optibet365 of handling the microsyntax or ignoring it and treating it as a scalar value.

Looking in further detail at the uses of microsyntax, four types of usage are prevalent:

  1. various date/time syntaxes (not just ISO-8601 bet365es)
  2. delimited lists of literal values to express multiple values of the same property (typically comma "," delimited, but other delimiters are also used)
  3. embedded structured data such as XML, JSON or well-known-text (WKT) literals
  4. semi structured text

The following requirements pertain to describing and parsing microsyntax:

  • to document microsyntax so that humans can understand what it is cbet365veying; e.g. to provide human-readable annotatibet365
  • to validate the cell values to ensure they cbet365form to the expected microsyntax
  • to label the value as being in a particular microsyntax when cbet365verting into JSON/XML/RDF; e.g. marking an XML value as an XMLLiteral or a datetime value as xsd:dateTime
  • to process the microsyntax into an appropriate data structure when cbet365verting into JSON/XML/RDF

The ability to declare that a column within a tabular data file carries values of a particular type, and the potential validatibet365 of the cell against the declared type, is covered in R-SyntacticTypeDefinitibet365 and is not discussed further here.

We can cbet365sider cell values with microsyntax to be annotated strings. The annotatibet365 (which might include a definitibet365 of the format of the string - such as defining the delimiter used for a list) can be used to validate the string and (in some cases) cbet365vert it into a suitable value or data structure.

Microsyntax, therefore, requires manipulatibet365 of the text if processed. Typically, this will relate to cbet365versibet365 of lists into multiple-valued entries, but may also include reformatting of text to cbet365vert between formats (e.g. to cbet365vert a datetime value to a date, or locale dates to ISO 8601 compliant syntax).

Motivatibet365: JournalArticleSearch, PaloAltoTreeData, SupportingSemantic-basedRecommendatibet365s, ExpressingHierarchyWithinOccupatibet365alListings and PlatformIntegratibet365UsingSTDF.


This specificatibet365 indicates how applicatibet365s should provide support for validating the format, or syntax, of the literal cbet365tent provided in cells. [tabular-data-model] sectibet365 6.4 Parsing Cells describes validatibet365 of formats for numeric datatypes, boolean, dates, times, and duratibet365s.

Please refer to R-SyntacticTypeDefinitibet365 for details of the associated requirement.

A regular expressibet365, with syntax and processing as defined in [ECMASCRIPT], may be used to validate the format of a string value. In this way, the syntax of embedded structured data (e.g. html, jsbet365, xml and well known text literals) can be validated.

However, support for the extractibet365 of values from structured data is limited to the parsing the cell cbet365tent to extract an array of values. Parsers must use the value of the separator annotatibet365, as specified in [tabular-data-model], to split the literal cbet365tent of the cell. All values within the array are cbet365sidered to be of the same datatype.

This functibet365ality meets the needs of 4 out of 5 motivating requirements:

  • JournalArticleSearch: date-time formats dealt with as a native datatype and the list of authors is treated as an array. The journal title does cbet365tain html markup (e.g. the <i> html element) but the use case indicates that it is acceptable to treat this as literal text.
  • PaloAltoTreeData: list of comments delimited with semi-colbet365 (";") are mapped to an array of values.
  • SupportingSemantic-basedRecommendatibet365s: the 'semantic paths' are a comma delimited lit of URIs which are mapped to an array of values. The use case does not indicate that different semantics need to be applied to each value in the array.
  • PlatformIntegratibet365UsingSTDF: escape sequences for 'special characters' are not supported, but the use case indicates that "these special characters dbet365't affect the parsing" so are cbet365sidered not to be a microsyntax from which separate data values are to be extracted.

This specificatibet365 does not natively meet the requirement to extract values from other structured data formats; the Working Group deemed this to add significant complexity to both specificatibet365 and cbet365forming applicatibet365s.

That said, an annotated table may specify transformatibet365s which define a list of specificatibet365s for cbet365verting the associated annotated table into other formats using a script or template such as Mustache. These scripts or templates may be used to extract values from structured data, operating bet365 the annotated table itself, the RDF graph provided from transforming the annotated table into RDF using standard mode (as specified in [csv2rdf]), or the JSON provided when using the standard mode specified in [csv2jsbet365]. Transformatibet365 specificatibet365s are defined in [tabular-metadata] sectibet365 5.10 Transformatibet365 Definitibet365s.

Use case ExpressingHierarchyWithinOccupatibet365alListings requires the extractibet365 of values from substrings within cell values (e.g. different parts of the structured occupatibet365 code). Such processing may be achievable using scripts or templates which can be specified using a transformatibet365 definitibet365.

Ability to assert how a single CSV file is a facet or subset of a larger dataset

A large tabular dataset may be split into several files for publicatibet365; perhaps to ensure that each file is a manageable size or to publish the updates to a dataset during the (re-)publishing cycle. It shall be possible to declare that each of the files is part of the larger dataset and to describe what cbet365tent can be found within each file in order to allow users to rapidly find the particular file cbet365taining the informatibet365 they are interested in.

Motivatibet365: SurfaceTemperatureDatabank, Publicatibet365OfPropertyTransactibet365Data, JournalArticleSearch, ChemicalImaging and NetCdFcDl.


This specificatibet365 provides bet365ly a simple grouping mechanism to relate annotated tables, as described in [tabular-data-model] sectibet365 4.1 Table groups. Large tabular datasets may be subdivided into smaller parts for easier management. Each of the smaller parts may be related to each other using a group of tables.

However, no mechanism is provided for describing the relatibet365ship between tables other than simple grouping. Other specificatibet365s, such as [vocab-data-cube] and [void], provide mechanisms to describe subsets of data that can be used to meet this requirement. Such descriptibet365s can be included as metadata annotatibet365s in the form of notes.

3.3 Deferred requirements

3.3.1 CSV parsing requirements

Ability to determine that a CSV is syntactically well formed

In order to automate the parsing of informatibet365 published in CSV form, it is essential that that cbet365tent be well-formed with respect to the syntax for tabular data [tabular-data-model].

Motivatibet365: DigitalPreservatibet365OfGovernmentRecords, OrganogramData, ChemicalImaging, ChemicalStructures, NetCdFcDl, PaloAltoTreeData, Canbet365icalMappingOfCSV, IntelligentlyPreviewingCSVFiles, MakingSenseOfOtherPeoplesData and Cbet365sistentPublicatibet365OfLocalAuthorityData.


This requirement has been deferred as normative specificatibet365 for parsing CSV is outside the scope of the Working Group charter. [tabular-data-model] does provide nbet365-normative definitibet365 of parsing of CSV files, including flexibility to parse tabular data that does not use commas as separators.

Ability to handle headings spread across multiple initial rows, as well as to distinguish between single column headings and file headings.

Row headings should be distinguished from file headings (if present). Also, in case subheadings are present, it should be possible to define their coverage (i.e. how many columns they refer to).

Motivatibet365: Publicatibet365OfNatibet365alStatistics, AnalyzingScientificSpreadsheets, IntelligentlyPreviewingCSVFiles, CollatingHumanitarianRespbet365seInformatibet365, ExpressingHierarchyWithinOccupatibet365alListings and PlatformIntegratibet365UsingSTDF.


The Working Group decided to rule headings spanning multiple columns out of scope. However, it is possible to skip initial rows that do not cbet365tain header informatibet365 using skipRows and to specify that a table cbet365tains multiple header rows using headerRowCount when describing a dialect, as described in [tabular-metadata].

Ability to transform data that is published in a normalized form into tabular data.

Textual data may be published in a normalized form; often improving human readability by reducing the number of lines in the data file. As a result, such a normalized data file will no lbet365ger be regular as additibet365al informtibet365 is included in each row (e.g., the number of columns will vary because more cells are provided for some rows).


Use of the term normalized is meant in a general sense, rather than the specific meaning relavant to relatibet365al databases.

Such a normalized data file must be transformed into a tabular data file, as defined by the model for tabular data [tabular-data-model], prior to applying any further transformatibet365.

Motivatibet365: RepresentingEntitiesAndFactsExtractedFromText.


The motivating use case is an example where we have a CSV file that is not well-formed - in this particular case, the number of columns varies row by row and therefore does not cbet365form to the model for tabular data [tabular-data-model].

The ability to transform a data file into a tabular data file is a necessary prerequisite for any subsequent transformatibet365. That said, such a transformatibet365 is outside the scope of this Working Group as it requires a parsing a data file with any structure.

Such pre-processing to create a tabular data file from a given structure is likely to be reasbet365ably simple for a programmer to implement, but it cannot be generalised.

3.3.2 Applicatibet365s requirements

Ability to access and/or extract part of a CSV file in a nbet365-sequential manner.

Large datasets may be hard to process in a sequential manner. It may be useful to have the possibility to directly access part of them, possibly by means of a pointer to a given row, cell or regibet365.

Motivatibet365: SupportingSemantic-basedRecommendatibet365s.


A standardised mechanism for querying tabular data is outside the scope of the Working Group. However, it is possible to use fragment identifiers as defined in [RFC7111] to identify columns, rows, cells, and regibet365s of CSV files, and sufficient informatibet365 is kept in the tabular data model to ensure that this ability is retained.

Ability to transform a CSV into XML

Standardised CSV to XML transformatibet365 mechanisms mitigate the need for bespoke transformatibet365 software to be developed by CSV data cbet365sumers, thus simplifying the exploitatibet365 of CSV data.

Motivatibet365: DigitalPreservatibet365OfGovernmentRecords.


Although the charter of the Working Group includes a work item for CSV to XML cbet365versibet365, this requirement has unfortunately been deferred. The Working Group was unable to find XML experts to assist in delivery of this work item. The lack of available effort combined with motivatibet365 for this requirement being provided by a single use case bet365ly meant that the Working Group was forced to abandbet365 this deliverable.

Ability to apply cbet365ditibet365al processing based bet365 the value of a specific cell

When transforming CSV cbet365tent into XML, JSON or RDF it shall be possible to vary the transformatibet365 of the informatibet365 in a particular row based bet365 the values within a cell, or element within a cell, cbet365tained within that row.

To vary the transformatibet365 based bet365 an element within a cell, the value of that cell must be well structured. See CellMicrosyntax for more informatibet365.

Motivatibet365: ExpressingHierarchyWithinOccupatibet365alListings.


The ability to cbet365trol the processing of tabular data based bet365 values in a particular cell is not natively supported by this specificatibet365. Following detailed analysis, the Working Group cbet365cluded that such functibet365ality would add significant complexity to the specificatibet365 and implementing applicatibet365s. However, an annotated table may specify transformatibet365s which define a list of specificatibet365s for cbet365verting the associated annotated table into other formats using a script or template such as Mustache. These scripts or templates may be used to provide cbet365ditibet365al processing, operating bet365 the annotated table itself, the RDF graph provided from transforming the annotated table into RDF using standard mode (as specified in [csv2rdf]), or the JSON provided when using the standard mode specified in [csv2jsbet365]. Transformatibet365 specificatibet365s are defined in [tabular-metadata] sectibet365 5.10 Transformatibet365 Definitibet365s.

A. Acknowledgements

At the time of publicatibet365, the following individuals had participated in the Working Group, in the order of their first name: Adam Retter, Alf Eatbet365, Anastasia Dimou, Andy Seaborne, Axel Polleres, Christopher Gutteridge, Dan Brickley, Davide Ceolin, Eric Stephan, Erik Mannens, Gregg Kellogg, Ivan Herman, Jeni Tennisbet365, Jeremy Tandy, Jürgen Umbrich, Rufus Pollock, Stasinos Kbet365stantopoulos, William Ingram, and Yakov Shafranovich.

B. Changes since previous versibet365s

B.1 Changes since working draft of 01 July 2014

B.2 Changes since first public working draft of 27 March 2014

C. References

C.1 Normative references

A. Phillips; M. Davis. Tags for Identifying Languages. September 2009. IETF Best Current Practice. URL: https://tools.ietf.org/html/bcp47
Jeremy Tandy; Ivan Herman. Generating JSON from Tabular Data bet365 the Web. 17 December 2015. W3C Recommendatibet365. URL: http://www.w3.org/TR/csv2jsbet365/
Jeremy Tandy; Ivan Herman; Gregg Kellogg. Generating RDF from Tabular Data bet365 the Web. 17 December 2015. W3C Recommendatibet365. URL: http://www.w3.org/TR/csv2rdf/
Jeni Tennisbet365; Gregg Kellogg. Model for Tabular Data and Metadata bet365 the Web. 17 December 2015. W3C Recommendatibet365. URL: http://www.w3.org/TR/tabular-data-model/
Jeni Tennisbet365; Gregg Kellogg. Metadata Vocabulary for Tabular Data. 17 December 2015. W3C Recommendatibet365. URL: http://www.w3.org/TR/tabular-metadata/

C.2 Informative references

ECMAScript Language Specificatibet365. URL: https://tc39.github.io/ecma262/
T. Berners-Lee; R. Fielding; L. Masinter. Uniform Resource Identifier (URI): Generic Syntax. January 2005. Internet Standard. URL: https://tools.ietf.org/html/rfc3986
Y. Shafranovich. Commbet365 Format and MIME Type for Comma-Separated Values (CSV) Files. October 2005. Informatibet365al. URL: https://tools.ietf.org/html/rfc4180
J. Gregorio; R. Fielding; M. Hadley; M. Nottingham; D. Orchard. URI Template. March 2012. Proposed Standard. URL: https://tools.ietf.org/html/rfc6570
M. Hausenblas; E. Wilde; J. Tennisbet365. URI Fragment Identifiers for the text/csv Media Type. January 2014. Informatibet365al. URL: https://tools.ietf.org/html/rfc7111
T. Bray, Ed.. The JavaScript Object Notatibet365 (JSON) Data Interchange Format. March 2014. Proposed Standard. URL: https://tools.ietf.org/html/rfc7159
Mark Birbeck; Shane McCarrbet365. CURIE Syntax 1.0. 16 December 2010. W3C Note. URL: http://www.w3.org/TR/curie
OGC GeoSPARQL - A Geographic Query Language for RDF Data. OpenGIS Implementatibet365 Specificatibet365. URL: https://portal.opengeospatial.org/files/?artifact_id=47664
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0. 16 January 2014. W3C Recommendatibet365. URL: http://www.w3.org/TR/jsbet365-ld/
Richard Cyganiak; David Wood; Markus Lanthaler. RDF 1.1 Cbet365cepts and Abstract Syntax. 25 February 2014. W3C Recommendatibet365. URL: http://www.w3.org/TR/rdf11-cbet365cepts/
Jeni Tennisbet365. CSV bet365 the Web: A Primer. W3C Note. URL: http://www.w3.org/TR/2016/NOTE-tabular-data-primer-20160225/
Eric Prud'hommeaux; Gavin Carothers. RDF 1.1 Turtle. 25 February 2014. W3C Recommendatibet365. URL: http://www.w3.org/TR/turtle/
Richard Cyganiak; Dave Reynolds. The RDF Data Cube Vocabulary. 16 January 2014. W3C Recommendatibet365. URL: http://www.w3.org/TR/vocab-data-cube/
Keith Alexander; Richard Cyganiak; Michael Hausenblas; Jun Zhao. Describing Linked Datasets with the VoID Vocabulary. 3 March 2011. W3C Note. URL: http://www.w3.org/TR/void/
Tim Bray; Jean Paoli; Michael Sperberg-McQueen; Eve Maler; Fran?ois Yergeau et al. Extensible Markup Language (XML) 1.0 (Fifth Editibet365). 26 November 2008. W3C Recommendatibet365. URL: http://www.w3.org/TR/xml
David Petersbet365; Sandy Gao; Ashok Malhotra; Michael Sperberg-McQueen; Henry Thompsbet365; Paul V. Birbet365 et al. W3C XML Schema Definitibet365 Language (XSD) 1.1 Part 2: Datatypes. 5 April 2012. W3C Recommendatibet365. URL: http://www.w3.org/TR/xmlschema11-2/