diff --git a/csv.md b/csv.md index 4c59392f..062c03ab 100644 --- a/csv.md +++ b/csv.md @@ -1,94 +1,62 @@ --- -language: CSV +name: CSV contributors: -- [Timon Erhart, 'https://github.com/turbotimon/'] + - [Timon Erhart, 'https://github.com/turbotimon/'] --- -CSV (Comma-Separated Values) is a lightweight file format used to store tabular -data in plain text, designed for easy data exchange between programs, -particularly spreadsheets and databases. Its simplicity and human readability -have made it a cornerstone of data interoperability. It is often used for -moving data between programs with incompatible or proprietary formats. - -While RFC 4180 provides a standard for the format, in practice, the term "CSV" - is often used more broadly to refer to any text file that: - -- Can be interpreted as tabular data -- Uses a delimiter to separate fields (columns) -- Uses line breaks to separate records (rows) -- Optionally includes a header in the first row +CSV (Comma-Separated Values) is a file format used to store tabular +data in plain text. ```csv -Name, Age, DateOfBirth -Alice, 30, 1993-05-14 -Bob, 25, 1998-11-02 -Charlie, 35, 1988-03-21 +Name,Age,DateOfBirth,Comment +Alice,30,1993-05-14, +Bob,25,1998-11-02, +Eve,,,data might be missing because it's just text +"Charlie Brown",35,1988-03-21,strings can be quoted +"Louis XIV, King of France",76,1638-09-05,strings containing commas must be quoted +"Walter ""The Danger"" White",52,1958-09-07,quotes are escaped by doubling them up +Joe Smith,33,1990-06-02,"multi line strings +span multiple lines +there are no escape characters" ``` -## Delimiters for Rows and Columns +The first row might be a header of field names or there might be no header and +the first line is already data. -Rows are typically separated by line breaks (`\n` or `\r\n`), while columns - (fields) are separated by a specific delimiter. Although commas are the most - common delimiter for fields, other characters, such as semicolons (`;`), are - commonly used in regions where commas are decimal separators (e.g., Germany). - Tabs (`\t`) are also used as delimiters in some cases, with such files often - referred to as "TSV" (Tab-Separated Values). +## Delimiters -Example using semicolons as delimiter and comma for decimal separator: +Rows are separated by line breaks (`\n` or `\r\n`), columns are separated by a comma. + +Tabs (`\t`) are sometimes used instead of commas and those files are called "TSVs" +(Tab-Separated Values). They are easier to paste into Excel. + +Occasionally other characters can be used, for example semicolons (`;`) may be used +in Europe because commas are [decimal separators](https://en.wikipedia.org/wiki/Decimal_separator) +instead of the decimal point. ```csv -Name; Age; Grade -Alice; 30; 50,50 -Bob; 25; 45,75 -Charlie; 35; 60,00 +Name;Age;Grade +Alice;30;50,50 +Bob;25;45,75 +Charlie;35;60,00 ``` ## Data Types CSV files do not inherently define data types. Numbers and dates are stored as - plain text, and their interpretation depends on the software importing the - file. Typically, data is interpreted as follows: +text. Interpreting and parsing them is left up to software using them. +Typically, data is interpreted as follows: ```csv -Data, Comment -100, Interpreted as a number (integer) -100.00, Interpreted as a number (floating-point) -2024-12-03, Interpreted as a date or a string (depending on the parser) -Hello World, Interpreted as text (string) -"1234", Interpreted as text instead of a number +Data,Comment +100,Interpreted as a number (integer) +100.00,Interpreted as a number (floating-point) +2024-12-03,Interpreted as a date or a string (depending on the parser) +Hello World,Interpreted as text (string) +"1234",Interpreted as text instead of a number ``` -## Quoting Strings and Special Characters +## Further reading -Quoting strings is only required if the string contains the delimiter, special - characters, or otherwise could be interpreted as a number. However, it is - often considered good practice to quote all strings to enhance readability and - robustness. - -```csv -Quoting strings examples, -Unquoted string, -"Optionally quoted string (good practice)", -"If it contains the delimiter, it needs to be quoted", -"Also, if it contains special characters like \n newlines or \t tabs", -"The quoting "" character itself typically is escaped by doubling the quote ("")", -"or in some systems with a backslash \" (like other escapes)", -``` - -However, make sure that for one document, the quoting method is consistent. - For example, the last two examples of quoting with either "" or \" would - not be consistent and could cause problems. - -## Encoding - -Different encodings are used. Most modern CSV files use UTF-8 encoding, but - older systems might use others like ASCII or ISO-8859. - -If the file is transferred or shared between different systems, it is a good - practice to explicitly define the encoding used, to avoid issues with - character misinterpretation. - -## More Resources - -+ [Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values) -+ [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180) +* [Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values) +* [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180)