Are you looking for a good way to move data from and to your app? You know, export and import? Do you wish for maximum compatibility? Nothing beats a good ol’ CSV.
Yeah, we all wish moving data around would be this glamorous.
Comma delimited files or csv are not a standard. In fact it is much more an idea than anything rigorously defined. But the idea is very attractive, because it is so simple. If you have a collection of records (eg a collection of customers), just put each record (customer) on a line and separate each record attribute (fe. phone or email) with a comma. And put the names of the attrbutes at the very beginning.
What could be simpler? It is so straightforward, it is bulletproof. Well, not really. The reality is, there are many complications.
Let’s recap the most common pitfalls.
Csv files are text files, which is great for searching or just peeking into a file, when no proper spreadsheet app is handy. But, as we know there are a baziljon ways on how to turn text to bytes.
This article will not attempt to convince you why you should use utf8. Let’s just say, there once were bazilion ways, these days you use utf8 with a proper BOM (byte order mark). The BOM is a great hint for any app that the following text is, in fact, utf8 and the user does not have fiddle with any settings.
Why do we spend so much virtual ink on this? Because, sadly, way to much software will not use utf8+bom by default when writing text, nor will it assume utf8 when reading.
NodeJS, does not automatically add the bom, even if you request utf8 encoding. Make sure you add it manually.
fs.writeFileSync('/temp/export.csv', '\ufeff' + csv_data);
// you can pass encoding, but node uses utf8 by default. The above is equivalent to this.
fs.writeFileSync('/temp/export.csv', '\ufeff' + csv_data, { encoding: "utf-8" });
// don't forget to strip the BOM!
const text_raw = fs.readFileSync(path, { encoding: "utf-8" }).replace(/^\uFEFF/, '');
You say comma separated, so which comma are we talking about? It can be either ; or , it depends on your app, or even language settings. Excel will happily use one in the en-us region and the other in the sk region. Then there is the closely related format tsv, which is just a csv with the tab character \t used instead of a comma.
Nope, only workarounds. Excel supports declaring the separator as the very first line.
sep=;
name;email;address;
Karol;...
Of course support for this is spotty, OSX Numbers does to seem to like it.
Since a semi-colon is less likely to be used in text, we would suggest to stick with it. But it is best to allow the user to choose, and you can also try to “sniff”. Read the first line (with is usually the header) and check whether there are ; or \t or , .
No, not from Alcatraz. Escaping in csv means substituting illegal characters in the final text. Which characters are illegal? Well at the very least the separator:).
The solution to this is to put the value (eg the customer’s address) into double quotes. Ok, but what if there is a quote in the value? We, double the quotes, of ehm course.
If the record value is csv "Format"?
, it has to be written as "csv ""Format""?"
At this point I hope you see, why I was hesitant to call csv a format. It is a collection of hacks made to fix a problem and move on.
Now you can also find apps that escape other characters. But you cannot rely on it:( The biggest and ugliest problem is caused by unescaped new lines. The sample below is a perfectly valid csv… unfortunately.
name;address
Carol;"senec
Slovakia
Europe"
John Connor;"New York
United States"
Yes a plain new line, inside quotes is perfectly a-ok.
Now, be a good citizen and don’t do that. Instead use the classic C language escape - turn the new line into a back-slash and char n.
Here is a quick and dirty method that will help with this situation. This will not scale beyond a few tens of megabytes, so beware.
const fixNewLines = (input) => {
let output = "";
let quoted = false;
for (let i = 0; i < input.length; i++) {
const c = input[i];
if (c === "\r") {
if (input[i + 1] !== "\n") {
output += quoted ? "\\n" : "\n";
continue;
}
continue;
}
if (quoted && c === "\"") {
if (input[i + 1] === "\"") {
output += "\"\"";
i++;
continue;
}
quoted = false;
} else if (!quoted && c === "\"") {
quoted = true;
} else if (quoted && c === "\n") {
output += "\\n";
continue;
}
output += c;
}
return output;
}
A proper solution is a class that has a readline method, which will read data from the file until it finds a new line that is not within quotes.
Or just use a library;) then again, check the docs do test whether this situation is handled.
What about them? Thankfully the world agreed on using the Arabic numbers. Unfortunately we haven’t agreed on the decimal separator. So in slovakia 19,990 dollars is roughly 20 bucks, but it is 20 thousand in the US.
So is it best to use the decimal dot? Well, sadly, excel on slovak computers might reject 19.990 as a number, expecting the decimal comma. Again, there are no solutions only workaround. Either use the regional format or let the user decide.
If you need to round the numbers to a few decimal places, make sure to double check the floating point error handling.
In Javascipt the plain toFixed(2)
is not always accurate!
Thankfully, we can use the Intl
object. It will correctly round and format the number!
export const formatNumericValue = (precision: any, value: any, locale?: string) => {
let str = value === undefined || value === null ? "" : value as string;
const decimalPlaces = +(precision || "0");
if (str !== "") {
locale = locale || Intl.NumberFormat().resolvedOptions().locale;
const fmt = new Intl.NumberFormat(locale, {
style: "decimal",
minimumFractionDigits: decimalPlaces,
maximumFractionDigits: decimalPlaces
});
str = fmt.format(parseFloat(str));
}
return str;
}
There is an infinite way to write a date in excel. No, we will not use the date formatting feature, and we will use the character o instead of 0. Not only do we use spaces, dots, slashes for separating years, months and days. We also write them in arbitrary order, often changing every couple of lines. Within the same file, for crying out loud. And, of course we love to write years shorten to two digits.
The “proper” fix is to use the international date standard ISO 8601. Which nobody will use to write dates by hand, of course. And Excel will not recognize that as a date value! I'm sure that's not surprising, by this point.
Here is a little Excel one liner for ISO 8601 dates.
=DATEVALUE(MID(A1,1,10))+TIMEVALUE(MID(A1,12,8))
You can find event more details here.
So what? It depends. If you are writing csv files to be processed by other tools, I would suggest ISO dates. If it is for users directly, use the user’s regional/locale setting.
If you venture away from the ISO format, you’ll have to choose the locale and also often the timezone. It is super frustrating. If you have to read arbitrary date value, make sure you tell the user exactly which date value (with the line number) was incorrect and why!
Beyond date formatting, there is also the valid date range issue. Now 2 digit years can be (for a few more years) extended pretty easily. But a lot of software will not work with dates too far in the past or future! Microsoft SQL database will refuse dates beyond year 10.000 and before year 1.
If you stick a 0 for year into your db, it will be unhappy.
Or, if you add 20 to years shorter than 4 characters and you stumble upon a 3 digit year…
For example JS will handle dates in the range April 20, 271821 BC to September 13, 275760 AD, thus it will correctly parse or construct a date, that will blow up down the line. Par for the course I guess.
Read csv file and return a list of JSON objects.
const readCsvFile = (path, delim = ';') => {
const fixNewLines = (input) => {
let output = "";
let quoted = false;
for (let i = 0; i < input.length; i++) {
const c = input[i];
if (c === "\r") {
if (input[i + 1] !== "\n") {
output += quoted ? "\\n" : "\n";
continue;
}
continue;
}
if (quoted && c === "\"") {
if (input[i + 1] === "\"") {
output += "\"\"";
i++;
continue;
}
quoted = false;
} else if (!quoted && c === "\"") {
quoted = true;
} else if (quoted && c === "\n") {
output += "\\n";
continue;
}
output += c;
}
return output;
}
const parseLine = (s, sep) => {
const parts = [];
let quoted = false;
let acc = "";
for (let i = 0; i < s.length; i++) {
const c = s[i];
if (!quoted && c === sep) {
parts.push(acc);
acc = "";
continue;
}
if (c === "\"") {
if (!quoted) {
quoted = true;
continue;
}
else {
if (s[i + 1] === "\"")
i++;
else {
quoted = false;
continue;
}
}
}
if (quoted && c === "\\") {
if (s[i + 1] === "n") {
i++;
acc += "\n";
continue;
}
if (s[i + 1] === "\\") {
i++;
acc += "\\";
continue;
}
}
acc += c;
}
parts.push(acc);
return parts;
}
const text_raw = fs.readFileSync(path, { encoding: "utf-8" }).replace(/^\uFEFF/, '');
const text = fixNewLines(text_raw);
const lines = text.split("\n");
const sepLine = lines[0];
if (sepLine.startsWith("sep=")) {
lines.splice(0, 1);
delim = sepLine[4];
}
const hdr = parseLine(lines[0], delim);
const results = [];
for (let i = 1; i < lines.length; i++) {
const line = lines[i];
const p = parseLine(line, delim);
const o = {};
let empty = true;
for (let j = 0; j < hdr.length; j++) {
const value = (p[j] || "").trim();
if (value) {
o[hdr[j]] = value;
empty = false;
}
}
if (!empty)
results.push(o);
}
return results;
}
Write csv file given an array of JSON objects. It can handle different object shapes - with a different set of properties.
const writeCsvFile = (path, objs) => {
const hdr = {};
for (const obj of objs) {
for (const key of Object.keys(obj))
hdr[key] = 1;
}
const lines = []
const hdrs = Object.keys(hdr);
// sort the hdrs if you'd like
lines.push(hdrs.join(';'));
for (const o of objs) {
let arr = [];
for (const key of hdrs) {
let value = o[key];
if (value === null || value === undefined) {
value = "";
} else {
if (value instanceof Date) {
value = value.toISOString();
} else if (typeof value === "number") {
value = formatNumericValue(2, value);
} else {
value = "" + value;
}
if (value.indexOf(";") >= 0 || value.indexOf("\"") >= 0 || value.indexOf("\n") >= 0) {
value = '"' + v.replaceAll("\"", "\"\"").replaceAll("\\", "\\\\").replaceAll("\n", "\\n") + '"';
}
}
arr.push(value);
}
lines.push(arr.join(';'));
}
fs.writeFileSync(path, '\ufeff' + lines.join('\n'));
}
Bonus xml to csv in python. Some old school webservices still did not get the memo and stick to xml
.
Here is a quick and dirty way to turn an xml file into something you can check with Excel (or send to customer:).
def writeCsvFile(fileName, products):
columnIndexCounter = 0
columnIndex = dict()
for product in products:
for key in product:
if columnIndex.get(key) == None:
columnIndex[key] = columnIndexCounter
columnIndexCounter += 1
sb = ""
hdr = []
for x in range(0, columnIndexCounter):
hdr.append('')
for key in columnIndex:
hdr[columnIndex[key]] = key
sb += ";".join(hdr) +"\n"
for product in products:
outstr = []
for x in range(0, columnIndexCounter):
outstr.append('')
for key in product:
value = product[key]
if(value == None):
value = ""
if type(value) is list:
x = ""
for v in value:
x += v + "\t"
value = x
value = str(value).replace('"','""')
line_index = columnIndex[key]
outstr[line_index] = '"' + value + '"'
sb += ";".join(outstr) +"\n"
with open(filePath, 'w', encoding='UTF8', newline='') as f:
f.write('\ufeff')
f.write(sb)
def dumpXmlToCsv(filePath, url):
ssl._create_default_https_context = ssl._create_unverified_context
xmlfile = request.urlopen(url).read().decode("utf-8")
root = ET.fromstring(xmlfile)
def getTag(p):
tag = p.tag
# ignore namespace
if( '}' in tag):
tag = tag.split('}',1)[1]
return tag
def getText(p):
nodeText = p.text
# "handle" nested tags, somehow
if len(p):
nodeText = "["
for child in p:
nodeText += getTag(child) + "=" + getText(child)
nodeText += ","
nodeText += "]"
return nodeText
products = []
for item in root:
v = {}
for p in item:
tag = getTag(p)
nodeText = getText(p)
v[tag] = nodeText
products.append(v)
writeCsvFile(filePath, products)
Sure you can say I will use the library X, so I don’t care. For better or worse there are too many moving parts, and even using a library, you will have to setup configuration options - which means you have to understand the situation when they apply.
Or you might be dumping your data to csv, and feel a library is an overkill, as it is just a couple of strings to join. Hopefully, the above will help you choose some good defaults, so that whoever needs to process your files, does not lose valuable hair-estate.
Whatever the case, I hope you enjoyed the article.
Happy hacking!