What is Snowflake?

Snowflake is a new era relational SQL data warehouse built for the cloud that seeks to enable seamless and fully elastic access to business-critical data that satisfies everyone from Analysts to IT to Finance.

But why – aren’t there enough Data Warehouses already?!

tl;dr Quantity != Quality.

Snowflake offers decoupled elastic compute and storage that is flexible enough to handle semi-structured data types such as JSON yet rigid enough to provide great performance (see benchmark) and consistency for real-time reporting.

A Database is a logical container storing references to schemas and tables

A Database can have multiple Schemas which can contain multiple Tables – Captain Obvious for anyone familiar with SQL

Schemas are logical abstractions that contain tables

Tables contain metadata references to data types, columns and other details surrounding storage of actual data

Views are abstractions of arbitrary SQL queries

Read more about Databases, Schemas, Tables and Views here

View on Snowflake sample weather table
View metadata indicating the view’s output column names and types in worksheet tab

Sequences generate unique numbers across sessions and statements

Stages are locations where files are stored for loading into tables

File Formats are used to manage loading data in from supported file formats

There are 4 default Roles: 

  1. ACCOUNTADMIN – can manage everything (e.g. warehouses, databases, billing, users, etc) in the account
  2. SYSADMIN – can manage everything except account management (e.g. users & billing)
  3. SECURITYADMIN – can manage users and all facets of security (e.g. roles, policies, etc) including configuring existing warehouses, but cannot create warehouses
  4. PUBLIC – can access warehouses, write queries, create tables, etc

Ok, I’m sold – how can I get started?

Snowflake extends the typical SQL paradigm further than typically expected. With Snowflake, you can do basically anything and everything with SQL, from administrative tasks (e.g. user, database, and warehouse management) to traditional SQL query workloads (BI, analytics and engineering). Snowflake documentation already has a Snowflake in 20 minutes guide to getting started, which uses the snowsql command line to handle the entire process. Below, my guide will be visual (with screenshots) to show you how to do similar functionality using the web interface.

  • Create a warehouse
Create a “virtual warehouse” which acts as the compute layer
  • Create a database
Create a database, which contains references to schemas, tables, file formats and more
  • Open the database
Cracking open the database to look at options
  • Create a stage
Snowflake, S3 or Azure can manage the stage storage location
Name the stage and choose the schema
  • Create a file format for CSV (other options are JSON, XML, Avro, ORC and Parquet)
Creating a CSV file format in the DEV database

Loading a file on your local computer into Snowflake can be done in a number of ways, which can be reviewed here

We’re going to use the UI to load a file since this is a visual getting started guide.

  • Create a table (in a new worksheet)
New worksheet
CREATE OR REPLACE TABLE "DEV"."PUBLIC"."GREEN_TRIPS" (
  "VENDORID" NUMBER, 
  "LPEP_PICKUP_DATETIME" TIMESTAMP, 
  "LPEP_DROPOFF_DATETIME" TIMESTAMP, 
  "STORE_AND_FWD_FLAG" STRING, 
  "RATECODEID" NUMBER,
  "PULocationID" NUMBER,
  "DOLocationID" NUMBER,
  "passenger_count" NUMBER,
  "trip_distance" DOUBLE,
  "fare_amount" DOUBLE,
  "extra" DOUBLE,
  "mta_tax" DOUBLE,
  "tip_amount" DOUBLE,
  "tolls_amount" DOUBLE,
  "ehail_fee" STRING,
  "improvement_surcharge" DOUBLE,
  "total_amount" DOUBLE,
  "payment_type" NUMBER,
  "trip_type" NUMBER
) COMMENT = 'NYC TLC Green Trip Data';

Hit Run

Success!
  • Load file in via UI

Download a sample data file from the NYC Taxi and Limousine Commission (TLC) here: Green trip data

This is what the actual SQL would look like if you ran a version of these commands in SQL (such as in the snowsql command line client):

put file:///Users/garren/Downloads/green_tripdata_2017-06.csv @SFSTAGE; // This uses the SFSTAGE named internal stage

COPY INTO "DEV"."PUBLIC"."GREEN_TRIPS" FROM @SFSTAGE/green_tripdata_2017-06.csv FILE_FORMAT = '"DEV"."PUBLIC"."BASE_CSV"' ON_ERROR = 'ABORT_STATEMENT' PURGE = TRUE;
SELECT COUNT(*)
FROM GREEN_TRIPS;

Congrats! You’ve now setup your snowflake environment, created a table, loaded data into it and began consuming it!

Disclaimer

I have no affiliation with Snowflake and was not asked to write this guide – I’m just a fan of technology solving real problems with great potential


Using new PySpark 2.3 Vectorized Pandas UDFs: Lessons PySpark ML + NLP Workshop

Leave a Reply

Your email address will not be published.