Importing Data from a CSV

What is a CSV?

There should be little dispute that a long-time champion of the office-friendly data formats is the humble CSV. When it comes to the specifics, there are some variations between them, but the "Comma Separated Values" document is a hallmark of modern business. Sure, some are "Tab Delimited" or even "Semicolon Delimited" - but the general structure is consistent. A CSV has a header-row describing the data "rows" that follow, creating columns of data. The CSV is the text version of a spreadsheet, a similarity that can be credited for its widespread adoption.

Where CSVs Fall Down

One thing table formated records have always struggled with is the existence of relationships between data. Data is full of relationships and that's something we need to account for in most modern database structures. To get around this, we will import our data in "sets" that act as the different types of data in our content graph, allowing us to connect them through our expressive mutation API.

The Data Import Protocol

Anytime you take data from one shape and conform it to another shape, you are describing a multi-step proces. The following protocol are the steps we will be taking to get our data into GraphCMS.

Data Inspection

The Concept

First, let's look at our data. The original file is included in the GitHub repository, but for now, let's take a look at a (heavily truncated) snapshot.

+----+-----------------+-----------+--------------+--------+---------+
| ID |      Name       |   Brand   | Store Number |  Lat   |   Lon   |
+----+-----------------+-----------+--------------+--------+---------+
|  1 | Plaza Hollywood | Starbucks | 34638-85784  | 22.34… | 114.20… |
|  6 | Exchange Square | Starbucks | 34601-20281  | 22.28… | 114.15… |
+----+-----------------+-----------+--------------+--------+---------+

The data is a list of many Starbucks (and sibling brand) locations throughout the world. The actual fields are:

  • Store ID
  • Name
  • Brand
  • Store Number
  • Phone Number
  • Ownership Type
  • Street Combined
  • Street 1
  • Street 2
  • Street 3
  • City
  • Country Subdivision
  • Country
  • Postal Code
  • Coordinates
  • Latitude
  • Longitude
  • Timezone
  • Current Timezone Offset
  • Olson Timezone
  • First Seen

Our target shape will only include a subset of those files.

%0cluster_ownershipOwnership Enumcluster_brandBrandcluster_coffee_shopCoffee ShopshopStructNumberNameOwnershipCityCountryPostcodePhone NumberLocationBrand (Rel)Store IDOlson TimezonebrandStructNameCoffee Shop(s) (Rel)shopStruct:f2->brandStruct:f4ownershipStructLicensedJointVentureCompanyOwnedFranchiseshopStruct:f3->ownershipStruct:f0brandStruct:f4->shopStruct:f0

There's a number of ways to inspect the data, from formal data analysis tools like Tableau to the generalist tools like Google Sheets. Whatever method you use, it is helpful to explore the dataset to get a complete idea of what you ar working with. In the code repo linked below, there are some helpful utilities to address this issue.

Data Cleansing

Like every text based dataset, errors will abound. Anything from duplicate data, missing fields that would otherwise need to be required and more can exist in your dataset. It's vital to validate your assumptions about the data otherwise importing will feel more like wrestling with data, even on a good day.

Fortunately, our APIs do really well in helping with this.

Data Transformtion

Once your data is understood (inspected) and predictable (cleansed), it's time to transform the content into the shape our mutation API expects. Again, I've included helper code for this below, but as every project is unique, you'll need to adjust these for your use-case.

Import, Connect, Enjoy

The last step is to actually import the data. The primary concerns here involve inteligent batching, catching errors and ensuring the endpoint is open to accept incoming data. Once your data is imported, you now have well structured content backed by an expressive API that is ready to be consumed on any platform in a number of declaritive and expressive ways! Let's look at the code.

The Code

To help with the process above, we've created some simple scripts that can assist in the process. These scripts are not robust, they lack in-depth error checking, and probably lack some code elegance, but you can view them as an import workspace to quick-start your importing needs.

To follow along you'll need to signup for an account at GraphCMS, create a new project and implement the schema as defined above. For more details about creating a project with GraphCMS, check our getting started guide.

You can find the complete project over on GitHub.

The project structure is as follows.

data/
src/
  index.js  # for compiling with babel
  app.js
  inspectData.js
  utils/
  transforms/
  queries/ # not used in this project
  mutations/

Data Inspection

The file inspectData.js includes a series of helpers that let us check for uniqueness on various keys across our data set, report duplicates, check for the total supported character set and more.

The helpers are listed in utils/inspect.js, and the utilities have been abstracted for general purpose usage which should allow them to be used in any project.

/*
Simple helper to pick the value off the
first key of a passed in object.
*/
const firstKey = obj => Object.keys(obj)[0];

/*
Helper to reduce the data to an object
with a shape of a single key,
provided as the only parameter.
*/
export const pickKey = key => data =>
  data.map(obj => {
    const newObj = {};
    newObj[key] = obj[key];
    return newObj;
  });

/*
Returns a list of all the unique values.
*/
export const uniqueValues = async data => {
  return [...new Set(data.map(obj => obj[firstKey(obj)]))];
};

/*
Returns all the chars that need to be
supported by a field. Helpful for validating
if a string field is needed or if integer
is allowed.
*/
export const uniqueChars = async data => {
  return [
    ...new Set(
      data.reduce(
        (prev, curr) => [...prev, ...curr[firstKey(curr)].split('')],
        []
      )
    ),
  ].join();
};

/*
Checks to see if the provided keys have data.
*/
export const requiredKeys = (keys, identityKey) => async data => {
  return [
    ...new Set(
      data.reduce((prev, curr) => {
        for (let key of keys) {
          if (!curr[key].length) prev.push(curr[identityKey]);
        }
        return prev;
      }, [])
    ),
  ];
};

/*
Checks to see if there are duplicates, boolean response.
*/
export const uniqueKey = key => async data => {
  const newLength = [
    ...new Set(pickKey(key)(data).map(obj => obj[firstKey(obj)])),
  ];

  console.log(data.length);
  console.log(newLength.length);

  if (data.length !== newLength.length) return false;

  return true;
};

/*
Reports all duplicates with an optional diagnostic key
as input.
*/
export const findDuplicates = key => async (data, diagnosticKey) => {
  return data
    .sort((a, b) => {
      if (a[key] < b[key]) return -1;
      if (a[key] > b[key]) return 1;
      return 0;
    })
    .filter((el, index, arr) => {
      if (index === 0) return false;
      return el[key] === arr[index - 1][key];
    })
    .map(el => el[diagnosticKey || key]);
};

You can implement them like the following.

This document is unique to the provided dataset. You'll need to work with your own keys.
import {
  pickKey,
  uniqueValues,
  uniqueChars,
  uniqueKey,
  requiredKeys,
  findDuplicates,
} from './utils/inspect';
import { readFile } from './utils/readCSV';

export const inspectData = async () => {
  const data = await readFile(
    './data/all_starbucks_locations_in_the_world.csv'
  );

  const collection = {};

  // Check for Brands
  const brands = pickKey('Brand');
  collection.brands = await uniqueValues(brands(data));

  // Get List of Ownership Type Enum
  const ownershipType = pickKey('Ownership Type');
  collection.ownershipEnum = await uniqueValues(ownershipType(data));

  // Get List of unique phone chars
  const phone = pickKey('Phone Number');
  collection.phoneChars = await uniqueChars(phone(data));

  // Check for missing keys
  const checkEmptyLocation = requiredKeys(['Latitude', 'Longitude'], 'Name');
  collection.missingLocationData = await checkEmptyLocation(data);

  // Check for uniqueness of names.
  const uniqueName = uniqueKey('Name');
  collection.uniqueNames = await uniqueName(data);

  // Data cleansing revealed I had duplicates, this helps find them.
  const findCommonName = findDuplicates('Name');
  const commonNames = await findCommonName(data);
  collection.commonNamesFlat = [...new Set(commonNames)];

  // Looking for Duplicate Lat/Long values
  const uniqueLatitude = uniqueKey('Latitude');
  collection.uniqueLatitudes = await uniqueLatitude(data);

  const uniqueLongitude = uniqueKey('Longitude');
  collection.uniqueLongitudes = await uniqueLongitude(data);

  console.log(collection);
};

Data Cleansing and Transformation

My transformation methods are in /transforms, I recommend using a similar pattern. The transforms will be applied in the next section.

This document is unique to the provided dataset. You'll need to work with your own values.
const ownership = {
  LS: 'Licensed',
  CO: 'CompanyOwned',
  JV: 'JointVenture',
  FR: 'Franchise',
};

const duplicates = [
  'CentrO Ground Floor',
  'Division del Norte',
  'Gouda Station',
  'Lemessos Enaerios',
  'Mabohai Shopping Mall',
  'Magnolia',
  'Plaza America',
  'SPA',
  'Starbucks',
  'مركز أوتاد',
];

export const transform = arr => {
  return arr.map(obj => {
    const data = {};
    data.status = 'PUBLISHED';
    data.number = obj['Store Number'];
    data.name = duplicates.includes(obj.Name)
      ? `${obj.Name} ${obj['Store ID']}`
      : obj.Name;
    data.ownership = ownership[obj['Ownership Type']];
    data.city = obj.City;
    data.country = obj.Country;
    data.postcode = obj['Postal Code'];
    data.phoneNumber = obj['Phone Number'];
    data.location = {
      latitude: obj.Latitude / 1,
      longitude: obj.Longitude / 1,
    };
    data.storeID = obj['Store ID'] / 1;
    data.olsonTimezone = obj['Olson Timezone'];

    return data;
  });
};

Import

The final step is to put the pieces together, transform the content and import the content.

You'll first need to write a mutation query and save that in the mutation folder. Mine looks like the following:

mutation UpdateCoffeeShop(
  $brand: String
  $data: [CoffeeShopCreateWithoutBrandInput!]
) {
  updateBrand(
    where: { name: $brand }
    data: { coffeeShops: { create: $data } }
  ) {
    id
  }
}

I've provided some helper utilities using Axios, you will need to provide your own Project API ID and create a secure token for importing the data. For more information on creating a token, check the docs on working with Permanent Auth Tokens.

Fill out the following values in your `.env file.

MUTATION_ENABLED_PAT=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUz…
PROJECT_ID=ck2lzdpl52f…

When you put all the parts together, you will end up with a file like this:

import { uploadMutation } from './utils/fetch.js';
import { readFile } from './utils/readCSV';
import { inspectData } from './inspectData';
import { paginate } from './utils/paginate';
import { transform } from './transforms/coffetype';

import mutation from './mutations/batchImportCoffeeShop.graphql';

const run = async () => {
  const data = await readFile(
    './data/all_starbucks_locations_in_the_world.csv'
  );

  const brand = 'Coffee House Holdings';
  const filteredData = data.filter(el => el.Brand === brand);
  const transformedData = transform(filteredData);

  paginate(transformedData, uploadMutation(mutation, { brand: brand }, 'name'));
};

try {
  console.log('Get Ready!');
  // inspectData()
  run();
} catch (e) {
  console.log('Server Error!', e);
}

The pagination function implements a two-second delay and a default page size of 50 records. You can pass in a custom page size as a third parameter to the paginate function.

Enjoy

And that's the general flow! Once you've completed all those steps, you can begin expressively query for data in meaningful ways! Enjoy!