A Marketer’s Guide To Big Data And Cloud Computing…

Intro

Big data and cloud computing, we should leave that to the engineers…right? Well, product marketers need to have a basic conceptual understanding of these topics in order to more effectively explain the value proposition of many software products that use technololgy related to big data and cloud computing.

I have written this post to act as a study guide for cloud computing and big data for marketers.

The Value Of Data

Data is king now. I recently read an article that made the claim that data is now a more valuable commodity than natural oil.

Companies like Facebook, Twitter and Uber, do not have a physical product but make money by profiting off of their user’s data. According to TechCrunch, Facebook was worth $506.2 billion as of July 2018. Obviously, big data is big business.

Obviously, companies like Facebook make money by recording how users interact with their apps/websites and then using that data for various purposes.

In the case of Facebook, they record how users interact with their app and then use that data to provide marketing services to people who are looking to use Facebook for paid advertising. They essentially sell your data to other people.

I personally think that in 10 years or so, data very well may be considered an asset by accountants and show up on company financial statements. This is relevant because Facebook has 2.32 billion users worldwide according to Zephoria, which means that Facebook is processesing ungodly amounts of data. Also, this helps answer the question of: 

What is Big data?

First of all, what is data? According to Merriam-Webster, data is “factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation.” In short, data is factual information. 

According to Intellipat.com, big data is defined as “extremely large data sets that are so complex and unorganized that they defy the common and easy data management methods that were designed and used up until the extreme rise in data.”

Essentially, big data is data that is too large and unorganized to be organized, manipulated, stored and analyzed with traditional databases and analytics.

In the case of Facebook, Facebook can track how many times a user opens their app, logs into their website, how far down the page the scroll, what kinds of content they view, etc. They are able to track all of this data for all of their 2.32 billion users.

Does Big Data Refer To A Volume Or A Technology?

Is big data an amount of data, or is it a technology? The term implies that it is an amount of data, but that is not always true.

In the context of how companies use big data, that typically refers more to the technology that is used to analyze, manipulate and store big data. When the term “Big Data” is used, it often refers to a new approach to managing and analyzing data that transcends relational databases.

Where relational databases are fine tools for use with many applications the sheer scale of data that is available from many public services (like Facebook) overwhelms a classic relational database with SQL (Structured Query Language) approach. 

The tools and techniques for managing and storing Big Data for analysis is a worthy topic for another Blog Post. Tools like the Cassandra key-value (NoSQL) datastore were developed at Facebook to address issues with very large amounts of data. It was developed there and then released as an Open Source project in 2008. Other examples of Key value stores include Accumulo, Oracle Berkley DB, REDIS and many others. A guide to these is available here.


Besides a new approach to databases for handling massive amounts of data there have been breakthroughs in high-speed stream processing software to help companies ingest massive amounts of data.

One approach that is used for processing streams of data such as log messages or posts that are arriving to be processed is through Kafka, an Apache SW foundation project originally developed at LinkedIn. It was released as Open Source SW in 2011.

Kafka is widely adopted for ingesting and processing large amounts of data entering a data center on a constant basis. Kafka is typically used “in front of” Cassandra or Accumulo to initially process and then save vast quantities of data. More about this is explained here.

The reader can dive into the links above and learn more about these tools and learn how these large volumes of data are processed. This article will now concentrate on what data is and how big “Big Data” can truly grow. 

Measuring Data

The smallest unit of measurement is a bit, which is 1/8 of a byte. 8 bits equals 1 byte. A byte is the fundamental unit of measurement for data.

A single bit can have a value of either 0 or 1. The 0 and 1 may correspond with a binary value such as off or on switches, or true or false statements.

At its most basic, this is how computers work. Computers basically operate using on or off switches or true or false statements.

A byte can store 28 or 256 different values, which can be used to represent standard ASCII characters, such as letters, numbers and symbols.

For example, the keys on a keyboard work by using binary code. An arrangement of 1s and 0s correspond with a certain key. Each key uses 1 byte of data. 

The keyboard first assigns an ASCII value to each key. Then, converts each key to binary. For example, the “H” key has an ASCII value of 104 and a binary value of 01101000. After converting the value to binary, the computer can then process and store the data.

All marketers should understand this because it will give them a basic understanding of how computers use data.

To the best of my knowledge, there is not a minimum amount of data that qualifies as “big data,” but in general, big data is measured in petabytes and exabytes. In general, big data is at least 1024 petabytes or exabytes. 

More On Data Measurement

The chart below from TechTerms.com offers a nice visual explaining how data is measured. 

The next unit of measurement is a kilobyte, which is 1000 bytes. The value of each unit continues to grow exponentially, as you can see in the chart below.

It was not all that long ago (10-15 years ago) that most storage devices were measured in gigabytes and terabytes were very rare.

My father used to run a company that specialized in organizing and analyzing extremely large datasets for legal firms and government entities.

I remember that he once was hoping that his company would land a client that had datasets that were 2-3 terabytes in size. At the time (circa 2006-07), datasets that size were practically unheard of.

I went through all of this for historical context and to give you a basic understanding of how data works, but also to make the point that “big data” is hard to quantify. Also, what constitutes “big data” may change in the future.

As you can see in the chart, the largest value is the Yottabyte, or 1000^8 bytes. For another reference, MAC computers often offer cloud storage for your MAC in the amount of 5-20 terabytes.

UnitValueSize
bit (b)0 or 11/8 of a byte
byte (B)8 bits1 byte
kilobyte (KB)10001 bytes1,000 bytes
megabyte (MB)10002 bytes1,000,000 bytes
gigabyte (GB)10003 bytes1,000,000,000 bytes
terabyte (TB)10004 bytes1,000,000,000,000 bytes
petabyte (PB)10005 bytes1,000,000,000,000,000 bytes
exabyte (EB)10006 bytes1,000,000,000,000,000,000 bytes
zettabyte (ZB)10007 bytes1,000,000,000,000,000,000,000 bytes
yottabyte (YB)10008 bytes1,000,000,000,000,000,000,000,000 bytes


Petabytes Storage Vendors

For further context of how big data can be used, some examples of companie in the big data storage space include:

Cloud Computing

The advent of the cloud has made storing extremely large datasets easier. The reader should understand that there is a distinction between the Cloud for computing and the Cloud for the Internet.

The Internet and The Cloud are two distinct things. The Cloud would not be possible without the Internet, but the Internet stands alone as a resource for moving information between users and applications.

Internet – a managed set of routers, switches and optical cross-connection equipment that implement the IP routing protocol (in the case of the routers and switches) and underlying data transmission (in the case of the optical network connecting everything) that provides ubiquitous connectivity for everything sending and consuming data. This is not to ignore the wireless networks operated by Verizon, AT&T, etc.

These of course supply vast amounts of connections to servers over the Internet. So there are many ways to access the Internet but without the Internet we could not have smartphones, laptops, servers or networked services (Facebook, LinkedIN, Google).

Cloud computing – a managed set of servers, storage and internal network connections (between the servers and storage) that allows applications to run on behalf of enterprises or individuals. Amazon Web Services (AWS) is a good example of such a Cloud service.

An AWS user can “spin up” servers that run somewhere in an Amazon data center without the user having to know or care what physical server is used to run an application or where its storage may be located. The details of having a server with the internal resources (storage, internal network connectivity to that storage) are managed by Amazon.

To the user of AWS the set of servers and storage is then just a “Cloud” of computing capacity that can be utilized more or less “on demand” for a monthly fee.

Cloud computing services such as Google, Amazon Web Services (AWS), Microsoft Office 365, Microsoft Azure (Microsoft’s computing Cloud that competes with AWS), Facebook, LinkedIN, etc. maintain servers that are accessible over the Internet. In one way or another these services are available for computing, communicating with others, posting resumes and viewing career connections etc..

Here are some more examples of cloud computing. 

In other words, the software and hardware that your company needs to function is provided to you by another company and accessed over the internet. This is called Infrastructure as a service

Companies like Amazon offer Iaas, which is a pay-as-you-go service that allows companies to store their data on Amazon’s servers and then access that data via the cloud via an API usually (almost always) accessed over the HTTPS protocol. The HTTPS protocol is used for accessing web services as it encrypts traffic between the user and the service. In the age of Cloud security is a major issue (this will be a Blog post all its own as well).

The advent of the cloud has made storing these extremely large datasets easier.

APIs

APIs could be a blog post in and of themselves, as there are many different kinds of APIs. Marketers should have an understanding of APIs because they play an important role in software integration. Essentially, they help make different software programs speak to each other. I will not go into too much detail about APIs, but I will explain briefly and give an example.

Here is a good video analogizing an API to a waiter in a restaurant.

API stands for ‘application programming interface,’ and Google defines it as “a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service.”

In regards to the waiter analogy, an API is like the waiter, the waiter goes back and forth between the table and the kitchen. The waiter writes down what the customer would like to eat then goes back to the kitchen to tell the chefs, the chefs make the order, and then the waiter brings the food back to the table.

For example, when Facebook bought Instagram it did not have access to all of Instagram’s data (which Facebook needed to sell advertising). Instagram had a gold mine of valuable data such as how far down the Instagram feed people have scrolled, what content they viewed, what pictures they liked, etc.

Facebook uses APIs like a waiter to contact Instagram’s servers and extract that data. Essentially, Facebook’s APIs would say “ok, I want all of the data regarding how far down the newsfeed people have scrolled.”

Instagram’s “chefs” or engineers had already packaged up that data and stored it on servers. They then used a protocol that gave Facebook’s APIs permission to extract that data. In a nutshell, that is how APIs work.

Conclusion

This post should serve as a good introduction to cloud computing and big data. I will write an additional post at a later time to address more details of internal data center technologies, so stay tuned for that!