Monday, September 2, 2013

I want to be a Data Scientist

I want to be a Data Scientist

I saw a really interesting YouTube video today on called [The Data Scientists Toolset] [1]
[1]: "YouTube video"
It is a video of a panel discussion from a conference called Data Scientist Summit (note to self: I have really missed the boat if the others are already having summit’s… back in 2012! ;) )
I admit, I did not know anyone on the panel but after listening to them talk I believe they must all be experts of some note.
Some of the key points for me were:

  • The 3 big things you need as a Data Scientist
  • Value from Big Data = having Big Analytics
  • Run experiments ‘at scale’
  • Room for Everyone - Hadoop, NoSQL and “new SQL”, and
  • The ‘Desert Island challenge’

I’ll cover each in a bit more detail below.

The 3 big things you need as a Data Scientist

According to the experts on the panel, there are 3 big things that Data Scientists need to have:

  1. Domain skills and expertise,
  2. Great modelling (read statistics) skills, and
  3. Tech literacy with the Big data (and other) tools and technologies required.

To me, this is a great list of reasons for good collaboration between the Business and IT. Business professionals ideally have good Domain skills and experience. Visa versa, IT professionals typically have technology literacy.

The 'middle ground' Great modelling (statistics) skills is the interesting one. Some people have this based on whatever they did in Uni and continued into their professional career.

Its more likely that Business professionals are going to have the right skills and experience, especially in business domains such as Economics, Finance, Science, Research, etc.

However, of the 3 required areas of skills/ experience, this is the most likely to 'fall between the cracks' i.e. no-one has them.

I think this is (perhaps) the reason that higher education qualifications being offered by Universities around the world are so 'heavy' in statistics and maths.

Value from Big Data = having Big Analytics

I think this was a great point!

I see a lot of excitement (almost hysteria) about Big Data and how **cool** it is to be able to parse the Petabytes of log files and other Big Data out there but … where is the value?

Big Data is often associated with the 3 'qualifying' V's - Volume, Variety and Velocity.

I think it is a good idea to add 2 more 'quantifying' V's to the list - Veracity and Value.


Veracity to me examines the question of the 'validity' of the data source in terms of what people want to do with it.

One of the lessons I have learnt from just 'normal' Data warehouse and Business Intelligence/Analytics solutions is just because there is data out there it does not mean you should try and capture it and make use of it. You really need to ask yourself the question: is this data appropriate to my needs? or, in a qualitative sense, how appropriate is the data? (does it do part of the job?)


The 'flip side' is, Is there value in using this data? Does it help me tell the right story?
Big Data to me has a huge risk to be addressed - the GIGO (Garbage In, Garbage Out) principle means that people risk *Big Garbage*.

Run experiments ‘at scale’

Gone are the days of having to have small amounts of data to test your 'models' and to validate that they produce the 'right' results before trying them out on the 'real data' (usually Production or a copy of Production).

The panel stressed that Big Data tools and technologies allow you to operate 'at scale'.

Personally, I'm not sure about this one. I may not be a gun at statistics but I seem to remember that it does not take much data to provide a statistically valid model e.g. to predict the outcome of an election all you really need is a relatively small, but representative, sample from the population to have confidence in the results predicted… assuming that the rest of the population follow certain rules.

I think there's a difference between validating your models and running on full scale data.
Just because Big Data has 'resources to burn' I don't think people should lose sight of good modelling and testing.

Room for everyone

I think its 'reassuring' that Big Data is seen as a complementary technology and is best applied to suitable 'problems' (or classes of problem).

The panel made it clear they see a role for all of the data technologies: Big Data (e.g. Hadoop), NOSQL, and 'new SQL'.

One criterial they suggested for deciding which data technology was a best fit was whether the model of the data was 'to be discovered', partially agreed, or agreed (respectively).

Big Data technologies are typically associated with 'a model at use time' versus 'new SQL' where the modelling takes place first and then the data is poured in.

The ‘Desert Island challenge’

When it was time to wrap up, the moderator for the expert panel session posed a question: If you were (to be) stranded on a desert island, what tool or technology would you take with you … and only one!
Interestingly, **all** of the panel members named a programming language technology: Java, C++, Python, etc.

I guess this speaks to the 'roots' of the panelists and the fact that the Big Data tools and technology, while all useful in their own right, are not quite there yet to be able to dislodge the versatility and power provided by a programming language.

I hope this 'commentary' was of interest. I would encourage you to view the YouTube video for yourself. I am sure you will get different stuff out of it than I did.

More on Big Data to come in future Blogs.

1 comment:

  1. Update: found this good preso on YouTube re the different types of 'database' technologies out there - related to the 'room for everyone' section of my Blog