Quantcast
Channel: Thoughtware – ML/AI

How to Buy SAS Visual Analytics

$
0
0

Stories about SAS Visual Analytics are among the most widely read posts on this blog.  In the last two years I’ve received many queries from readers who complain that it’s hard to get clear answers about the software from SAS.

In software procurement, the customer has bargaining power until the deal closes; after that, power shifts to the vendor.   In this post, I’ve compiled some key questions prospective customers should resolve before signing a license agreement with SAS.

SAS Visual Analytics (VA), first launched in 2012, is now in its seventh dot release.  With a total of ~3,400 sites licensed, the most serious early release issues are resolved.  The product itself has improved.  In early releases, for example, it was impossible to join tables after loading them into VA; now you can.  SAS has gradually added features to the product, and will continue to do so.

Privately, SAS account executives describe VA as a “Tableau-Killer”; a more apt description is “Tableau for SAS Lovers.”   An experienced Tableau user will immediately notice features missing from VA.  On the other hand, SAS offers some statistical features (SAS Visual Statistics) not currently available in Tableau, for an extra license fee.

As this chart shows, Tableau is still alive:

SASVA vs Tableau

Source: Tableau Annual Report: SAS Revenue Press Release

SAS positions VA to its existing BI customers as a replacement product, and not a moment too soon; Gartner reports that organizations are rapidly pulling the plug on the legacy SAS BI product.  SAS prices VA to sell, clearly seeking to underprice Tableau and build a footprint.  Ordinarily, SAS pricing is a closely held secret, but SAS discloses its low VA pricing in the latest Gartner BI Magic Quadrant report.

Is VA the Right Solution?

VA works with SAS LASR Server, a proprietary in-memory analytic datastore, which should not be confused with in-memory databases like SAP HANA, Exasol or MemSQL.   In-memory databases have many features that are missing from LASR Server, such as ACID compliance, ANSI SQL engines and automated archiving.  Most in-memory databases can update data in real time; for LASR Server, you update a table by reloading it.  Commercial in-memory databases support many different end-user products for visualization and BI, so you aren’t locked in with a single vendor.  LASR Server supports SAS software only.

Like any other in-memory datastore, LASR Server is best for small high-value databases that will be queried by many users who require low latency.  LASR Server reads an entire table into memory and persists it there, so the amount of available memory is a limiting factor.

Since LASR Server is a distributed engine you can add more servers if you need more memory.  But keep in mind that while the cost of memory is declining, it is not free; it is still quite expensive per byte compared to disk storage.  In practice, most working in-memory databases support less than a terabyte of data.  By contrast, the smallest data warehouse appliances sold by vendors like IBM support thirty terabytes.

LASR Server’s principal selling point is speed.  The product is fast because it persists data in memory, and separates the disk I/O bottleneck from the user experience.  (You still need to load data into LASR Server, but you can do this separately, when the user isn’t waiting for a response.)

In contrast, Tableau uses a patented (e.g. proprietary) data engine that interfaces with your data source.  For extracts not already cached on the server, Tableau submits a query whose runtime depends on the data source; if the supporting database is poorly tuned, the query may take a long time to run.  In most cases, VA will be faster than Tableau, but it’s debatable how critical this is for a decision support application.

VA and LASR Server are the right solution for your business problem if all of the following conditions are true:

  • You work with less than a terabyte of data
  • You are willing to limit your visualization and BI tools to SAS software
  • You expect more than a handful of concurrent users
  • Your users require subsecond query response times

If you are thinking of using VA and LASR Server in distributed mode (implemented across more than one server), keep in mind that distributed computing is an order of magnitude more difficult to deliver.  Since SAS pitches a low-cost “Single Box Solution” as an entry-level product, most of those 3,400 customer sites run on a single server.  Before you commit to licensing the product in a multi-server configuration, you should insist on additional proof of product viability from SAS.  For example, insist on references from customers running in production in configurations at least as large as what you have in mind; and consider a full proof-of-concept (funded by SAS).

SAS’ low software pricing for VA makes it seem attractive.  However, you need to focus on the total cost of ownership, which we discuss below.

Infrastructure Costs

According to SAS’ sizing guidelines for VA, a single 16-CPU server with 256G RAM can support a 20GB table with seven heavy users.  (That’s 20 gigabytes of uncompressed data.)

For a rough estimate of the amount of hardware required:

  1. Determine the size of the largest table you plan to load
  2. Determine the total amount of data you plan to load
  3. Determine the planned number of “heavy” and “light users.  SAS defines a heavy user as “any SAS Visual Analytics Explorer user or a user who runs correlational analysis with multiple variables, box plots with four or more measures, or crosstabs with four or more class variables.”  In practice, this means every user.

In Step #4, you write a large check to your preferred hardware vendor, unless you are working with tiny data.

SAS will tell you that VA runs on commodity servers.  That is technically true, but a little misleading.  SAS does not require you to buy your servers from any specific vendor; however, the specs needed for good performance are quite different from a typical Hadoop node server.  Not surprisingly, VA requires specially configured high-memory machines, such as these from HP.

HP4VA

Node servers are just the beginning of the story. According to an HP engineer with extensive VA experience, networking is a key bottleneck in implementations.  Before you sign a license agreement for VA, check with your preferred hardware vendor to determine how much experience they have with the product.  Ask them to provide a firm quote for all of the necessary hardware, and a firm schedule for delivery and installation.

Keep in mind that SAS does not actually recommend hardware for any of its software.  While SAS will work with you to estimate volume and workload, it passes this information to the hardware vendors you specify for the actual recommended sizing and configuration.  Your hardware vendor plays a key role in the success of your implementation of this product, so it’s important that you choose a vendor that has significant experience with this software.

Implementation

SAS publishes most of its documentation on its support website.  For VA, however, SAS keeps technical documentation for installation, configuration and administration under lock and key.  The implication is that it’s not pretty.  Before you sign a license agreement, you should insist that SAS provide the documentation for your team to review.

There is more to implementing this product than software installation.  Did you notice the fine print in SAS’ Hardware Sizing Guidelines?  I quote:

“These guidelines do not address the data management resources needed outside of SAS Visual Analytics.  Getting data into SAS Visual Analytics and performing other ETL functions are solely the responsibility of the user.”  

VA’s native capabilities for data cleansing and transformation have improved since the first release, but they are still rudimentary.  So unless your source data is perfectly clean and ready to use — ha ha — you’re going to need ETL processes to prepare your data.  Unless your prospective users are ETL experts, they will need someone to build those feeds; and unless you have SAS developers sitting on the bench, you’re going to need SAS or a SAS Partner to provide developers who can do the job.

If you are thinking about licensing VA, you are almost certainly using legacy SAS products already.  You may think that will make implementation easier, but think again: VA and LASR Server are fundamentally new products with a new architecture.  Your SAS users and developers will all need training.  Moreover, your existing SAS programs may need conversion to work with the new software.

Before you sign a license agreement for VA, insist on a firm, fixed price quote from SAS for all implementation tasks, including data feeds.  Your SAS Account Executive will tell you that SAS “does not do” fixed price quotes.  Nonsense.  SAS will happily give away consulting services if they can win your software business, so don’t take “no” for an answer.

SAS will need to do an assessment, of course, before fixing the price, which is fine as long as you don’t have to pay for it.

Time to Value

When SAS first released VA, implementations ran around three months under ideal circumstances.  Many ran much longer, due to unanticipated issues with networking and infrastructure.  With more experience, SAS has a better understanding of the product’s infrastructure requirements, and can set expectations accordingly.

Nevertheless, there is no reason for you to assume the risk of delay getting the product into production.  SAS charges you for a license to use the software from the moment you sign the contract; if the implementation project runs long, it’s on your dime.

You should insist on a firm contractual commitment from SAS to get the software up and running by a date certain, with financial penalties for failure to deliver.  It’s unlikely that SAS will agree to deferred payment of the first-year fee, or an acceptance deal, since this impacts revenue recognition.  But you should be able to negotiate an extended renewal anniversary based on the date of delivery and acceptance.  You can also negotiate deferred payment of the fixed price consulting fee.



Spark 1.4 Released

$
0
0

On June 11, the Spark team announced availability of Release 1.4.  More than 210 contributors from 70 different organizations contributed more than 1,000 patches.  Spark continues to expand its contributor base, the best measure of health for an open source project.

Screen Shot 2015-06-12 at 2.00.20 PM

Spark Core

The Spark team continues to improve Spark operability, performance and compatibility.  Key enhancements include:

  • The first phase in Project Tungsten performance improvements, a cache-friendly sort algorithm
  • Also for improved performance, serialized shuffle output
  • For the Spark UI, visualization for Spark DAGs and operational monitoring
  • A REST API for application information, such as job, stage, task and storage status
  • For Python users, support for Python 3.x, plus external spilling for Python groupByKey operations
  • Two YARN enhancements: support for YARN on EC2 and security for long-running YARN applications
  • Two Mesos enhancements: Docker support and cluster mode.

DataFrames and SQL

This release includes extensions of analytic functions for DataFrames, operational utilities for Spark SQL and support for ORCFile format.

A complete list of enhancements to the DataFrame API is here.

R Interface

AMPLab released a developer version of SparkR in January 2014.  In June 2014, Alteryx and Databricks announced a partnership to lead development of this component.  In March, 2015, SparkR officially merged into Spark.

SparkR offers an interface to use Apache Spark from R.  In Spark 1.4, SparkR supports operations like selection, filtering and aggregation on large datasets.  It’s important to note that as of this release SparkR does not support an interface to MLLib, Streaming or GraphX.

Machine Learning

In Spark 1.4, ML pipelines graduate from alpha release, add feature transformations (Vector Assembler, String Indexer, Bucketizer etc.) and a Python API.  Additional enhancements to ML include:

There appears to be an effort under way to rebuild MLLib’s supervised learning algorithms in ML.

Enhancements to MLLib include:

There is a single enhancement to GraphX in Spark 1.4, a personalized PageRank.  Spark’s graph analytics capabilities are comparatively static.

Streaming

The enhancements to Spark Streaming include improvements to the UI plus enhanced support for Kafka and Kinesis and a pluggable interface for write ahead logs.  Enhanced support for Kafka includes better error reporting, support for Kafka 0.8.2.1 and Kafka with Scala 2.11, input rate tracking and a Python API for Kakfa direct mode.


O’Reilly Data Science Survey 2015

$
0
0

O’Reilly releases its 2015 Data Science Salary Survey.  The report, authored by John King and Roger Magoulas summarizes results from an ongoing web survey.  The 2015 survey includes responses from “over 600” participants, down from the “over 800” tabulated in 2014.

The authors note that the survey includes self-selected respondents from the O’Reilly audience and may not generalize to the population of data scientists.  This does not invalidate results of the survey — all surveys of data scientists, including Rexer and KDnuggets — use unscientific samples.  It does mean one should keep the survey audience in mind when interpreting results.

Moreover, since O’Reilly’s data collection methods are consistent from year to year, changes from 2014 may be significant.

The primary purpose of the survey is to collect data about data scientist salaries.  While some find that fascinating, I am more interested in what data scientists say about the tasks they perform and tools they use, and will focus this post on those topics.

Concerning data scientist tasks, the survey confirms what we already know: data scientists spend a lot of time in exploratory data analysis and data cleaning.  However, those who spend more time in meetings and those who spend more time presenting analysis earn more.  In other words, the real value drivers in data science are understanding the client’s business problem and explaining the results.  (This is also where many data science projects fail.)

The authors’ analysis of tool usage has improved significantly over the three iterations of the survey.  In the 2015 survey, for example, they analyze operating systems and analytic tools separately; knowing that someone says they use “Windows” for analysis tells us exactly nothing.

SQL, Excel and Python remain the most popular tools, while reported R usage declined from 2014.  The authors say that the change in R usage is “only marginally significant”, which tells me they need to brush up on statistics.  (In statistics, a finding either is or is not significant at the preselected significance level; this prevents fudging.)  The reported decline in R usage isn’t reflected in other surveys so it’s likely either (a) noise, or (b) an artifact of the sampling and data collection methods used.

The 2015 survey shows a marked increase in reported use of Spark and Scala.  Within the Spark user community, the recent Databricks survey shows Python rapidly gaining on Scala as the preferred Spark interface.  Scala offers little in the way of native machine learning capability, so I doubt that the language has legs among data scientists.  On the other hand, respondents were much less likely to use Java, a finding mirrored in the Databricks survey.  Data scientists use Scala and Java to “roll their own” algorithms; but given the rapid growth of open source and commercial algorithms (and rapidly growing Python use), I expect that we will see less of that in the future.

Reported use of Mahout collapsed since the last survey.  As I’ve written elsewhere, you can stick a fork in Mahout — it’s done.  Respondents also said they were less likely to use Apache Hadoop; I guess folks have figured out that doing logistic regression in MapReduce is a loser.

Respondents also reported increased use of Tableau, which is not surprising.  It’s everywhere.

The authors report discovering nine clusters of respondents based on tool usage, shown below.  (In the 2014 survey, they found five clusters.)

Screen Shot 2015-10-05 at 11.19.29 AM

The clustering is interesting.  The top three clusters correspond roughly to a “Power Analyst” user persona, a business user who is able to use tools for analysis but is not a hardcore developer.  The lower right quadrant corresponds to a developer persona, an individual with an Engineering background able to work actively in hardcore programming languages.  Hive and BusinessObjects fall into a middle category; neither tool is accessible to most business users without some significant commitment and training.

Some of the findings will satisfy Captain Obvious:

  • R and ggplot
  • SAP HANA and BusinessObjects
  • C and C++
  • JavaScript and D3
  • PostgreSQL and Amazon Redshift
  • Hive, Pig, Hortonworks and Cloudera
  • Python, Scala and Java

Others are surprising:

  • Tableau and SAS
  • SPSS and C#
  • Hive and Weka

It’s also interesting to note that Amazon EMR and Amazon Redshift usage fall into different clusters, and that EMR clusters separately from Cloudera and Hortonworks.

Since the authors changed clustering methods from 2014 to 2015, it’s difficult to identify movement in the respondent population.  One clear change is reflected in the separate cluster for R, which aligns more closely with the business user profile in the 2015 clustering.  In the 2014 clustering, R clustered together with Python and Weka.  This could easily be an artifact of the different clustering methods used — which the authors can rule out by clustering respondents to the 2014 survey using the 2015 methods.

Instead, the authors engage in silly speculation about R usage, citing tiny changes in tiny correlation coefficients.  (They don’t show the p-values for the correlations, but I suspect we can’t reject the hypothesis that they are all zero; so the change from year to year is also zero.)  Revolution Analytics’ acquisition by Microsoft has exactly zero impact on R users’ choice of operating system; and Teradata’s support for R in 2014 (which is limited to its Aster boxes) can’t have had a material impact on data scientists’ choice of tools.

It’s also telling that the most commonly used tools fall into a single cluster with the least commonly used tools.  Folks who dabble with survey segmentation are often surprised to find that there is one big segment that is kind of a catchall for features that do not differentiate respondents.   The way to deal with that is to remove the most and least cited responses from the list of active variables, since these do not differentiate respondents; spinning an interpretation of this “catchall” cluster is rubbish.


Benchmark: Spark Beats MapReduce

$
0
0

A group of scientists affiliated with IBM and several universities report on a detailed analysis of MapReduce and Spark performance across four different workloads.  In this benchmark, Spark outperformed MapReduce on Word Count, k-Means and Page Rank, while MapReduce outperformed Spark on Sort.

On the ADT Dev Watch blog Dave Ramel summarizes the paper, arguing that it “brings into question..Databricks Daytona GraySort claim”.  This point refers to Databricks’ record-setting entry in the 2014 Sort Benchmark run by Chris Nyberg, Mehul Shah and Naga Govindaraju.

However, Ramel appears to have overlooked section 3.3.1 of the paper, where the researchers explicitly address this question:

This difference is mainly because our cluster is connected using 1 Gbps Ethernet, as compared to a 10 Gbps Ethernet in, i.e., in our cluster configuration network can become a bottleneck for Sort in Spark.

In other words, had they deployed Spark on a cluster with high-speed network connections, it likely would run the Sort faster than MapReduce did.

I guess we’ll know when Nyberg et. al. release the 2015 GraySort results.

The IBM benchmark team found that k-means ran about 5X faster in Spark than in MapReduce.  Ramel highlights the difference between this and the Spark team’s claim that machine learning algorithms run “up to” 100X faster.

The actual performance comparison shown on the Spark website compares logistic regression, which the IBM researchers did not test.  One possible explanation — the Spark team may have tested against Mahout’s logistic regression algorithm, which runs on a single machine.  It’s hard to say, since the Spark team provides no backup documentation for its performance claims.  That needs to change.


IBM and Spark (Updated)

$
0
0

Updated March 8, 2016.  After publishing this post, I met with several IBM executives at Spark Summit East, who confirmed the accuracy of the original post and provided additional detail, which I’ve included in this version.  Updates are in bold red italics.

IBM also provided the low-resolution image.

IBM has a good story to tell — one out of ten contributors to Spark 1.6 were IBM employees.  But IBM does not tell its story effectively.  Nobody cares that IBM invented the punch card and the floppy disk.  Nobody cares that IBM is so big it can’t tell a straight product story.  Bigness is IBM’s problem.

On June 15, 2015, IBM announced a major commitment to Spark.  As we approach Spark Summit East, I thought it would be fun to check back and see how IBM’s accomplishments compare with the goals stated back in June.

Before we start, I’d like to note that any contribution to Spark moves the project forward, and is a good thing.  Also, simply by endorsing Spark, IBM has changed the conversation.  In early 2015, some analysts and journalists claimed that Spark was overhyped and “not enterprise ready.”  We haven’t heard a peep from this crowd since IBM’s announcement.  For that alone, IBM should get some kind of prize.  :-)

In its announcement, IBM detailed six initiatives:

  • IBM will build Spark into the core of the company’s analytics and commerce platforms.
  • IBM’s Watson Health Cloud will leverage Spark as a key underpinning for its insight platform, helping to deliver faster time to value for medical providers and researchers as they access new analytics around population health data.
  • IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark’s machine learning capabilities.
  • IBM will offer Spark as a Cloud service on IBM Bluemix to make it possible for app developers to quickly load data, model it, and derive the predictive artifact to use in their app.
  • IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.
  • IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.

Let’s see where things stand.

Spark in IBM Analytics and Commerce Platforms

IBM has an expansive definition of “analytics”, reporting $17.9 billion in business analytics revenue in 2015.  IDC, which tracks the market, credits IBM with $4.5 billion in business analytics software revenue in 2014.  The remaining $13.4 billion, it seems is services and fluff, neither of which count when the discussion is “platforms.”

Of that $4.5 billion, the big dogs are DataStage InfoSphere, DB2, Netezza PureData System for Analytics, Cognos IBM Business Analytics and SPSS IBM Predictive Analytics — so this is where we should look when IBM says it is building Spark into its products.

Currently, Cloudant is the only IBM data source with a published Spark connector.  Want to access DB2 with Spark?  It’s a science project.  Of course, you can always use the JDBC connector if you’re patient, but the standard is a parallel high-speed connector, like SAS has offered for years.  An IBM insider tells me that there is a project underway to build a Spark connector for Netezza, which will be a good thing when it’s available.

Update: IBM has subsequently added the one-way single-threaded Netezza connector to Spark Packages.  It’s also available on Git and the IBM Developer site.

I would emphasize that a one-way single-threaded connector is useful once, when you decommission your Netezza box and move the data elsewhere.  Netezza developed a native multi-threaded connector for SAS in a matter of months, so it’s not clear why it takes IBM so long to deliver something comparable for Spark.  

 

Last October in Budapest, an IBM VP — one of 137 IBM Veeps of “Big Data”  — claimed that DataStage InfoSphere supports Spark now.  Searching documentation for InfoSphere 11.3, the most current version, produces this:

Screen Shot 2016-02-13 at 2.48.21 PM

IBM appears to have discovered a new kind of product management where you build features into a product, then omit them from the documentation.

That was the approach taken for IBM Analytics Server Release 2.1, which was packaged up and shoved out the door so fast the documentation folks forgot to mention the Spark pushdown.  That said, Release 2.1.0.1 is an improvement; all functions that Analytic Server can push down to MapReduce now push down to Spark, and IBM supports the product on Cloudera and MapR as well as Hortonworks and BigInsights.

It’s not clear, though, why IBM thinks that licensing Analytics Server as a separate product is a smart move.  Most of the value in analytics is at the top of the stack (e.g. SPSS or Cognos,) where users can see results.  Spark pushdown is rapidly becoming table stakes for analytics software; the smarter move for IBM is to bundle Analytics Server for free into SPSS, Cognos and BigInsights, to build value in those products.

It’s also curious that IBM simultaneously continues to peddle Analytics Server while donating SystemML to open source.  Why not push down to Spark through SystemML?

So far as Spark is concerned, IBM is leaving SPSS Statistics users out in the cold unless they want to add Modeler and Analytics Server to the stack.  For these customers, Alteryx and RapidMiner look attractive.

A search for Spark in Cognos documentation yields a big fat zero, which explains why Gartner just tossed IBM from the Leaders quadrant in BI.Screen Shot 2016-02-13 at 3.11.35 PM

Update: at my request, an IBM executive shared a list of products that IBM says it has rebuilt around Spark to date.  I’m publishing it verbatim for reference, but note that the list includes:

  1. Double and triple counting (“Watson Content Analytics integrates with Dataworks which integrates with Spark as a Service”).
  2. Products that do not seem to exist (Spark connectors to DB2 and Informix appear in IBM documentation as generic JDBC connections).
  3. Aspirational products (Spark on Z/OS).
  4. Projects that are not products (Watson Discovery Advisor ran a POC with Spark).
  5. Capabilities that require little or no contribution from IBM (Spark runs under Platform Symphony EGO YARN Service). 

Other than that, it’s a good list.

  • IBM BigInsights  ( Version 4.0 included Spark 1.3.1, version 4.1 includes Spark 1.4.1 – GA’ed August 25th
  • EHAAS (BigInsights on Cloud)  – Includes Spark version 1.3.1 – GA’ed June  2015
  • Analytic for Hadoop – Includes Spark version 1.3.1 – Beta . Will be replaced by Pay-go and include Spark 1.4.1
  • Spark-as-a Service  – Beta in July . GA – Oct. Currently uses Spark version 1.3.1. Will move to Spark 1.4.1
  • Dataworks (Only Cloud) – Beta  – Integrated with Spark Service. Uses Spark Version 1.3.1 
  • SPSS Analytic Server and SPSS Modeler – SPSS Modeler will support Spark version 1.4.1. GA planned for end of Q3, 2015
  • Cloudant  – Cloudant includes Spark Connector. This product is already GA.
  • Omni Channel Pricing – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Dynamic Pricing –  Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Mark Down Optimization –   Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Nimbus ETL – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Journey Analytics – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA end of 2015
  • IBM Twitter CDE  On Cloud – Internal only
  • IBM Insights for Twitter Service – On Cloud , Externally Available
  • Internet of Things Real time Analytics on Cloud.  Integrating with Spark 1.3.1  – Open Beta
  • Platform Symphony EGO Service – Integrated with Spark Service for Resource Scheduling and Management. Can also be used with Spark bundled with IBM Open Platform.
  • DB2 Spark Connector
  • Netezza Spark Connector
  • Watson Content Analytics  -> Integrated with Dataworks which is integrating with Spark-as-a-Service
  • Watson Content Services – > Planning to use Spark for Data Ingestion and Enrichment. 
  • Spark on Zlinux  -> Spark enabled on zLinux
  • Spark on ZOS -> Will be available by end of year
  • No response
  • GPFS -GPFS – Spark, as part of BI, runs out of the box on GPFS. In our GPFS Ambari RPM package we changed the Spark service dependency from HDFS to GPFS and added the GPFS Hadoop connector jar to the classpath
  • Informix Connector
  • Watson Discovery Advisor  –  This is a product within Watson Health.  In a small POC that this team has done. they observed that using Scala and Spark , they can reduce their lines of code from 1000s to few hundred lines.” Looking to integrate Spark in early 2016.
  • Cognos – Team indicated that they will be able to submit SPARK SQL queries for getting the results for the data. They would connect to Spark using the JDBC/ODBC driver and then be able to execute Spark SQL queries to generate results for the report . This is planned for 2016.
  • Streams –   Spark MLLib Toolkit for InfoSphere Streams 
  • Watson Research team – Developed a Geospatial RDD on Spark.

Spark in IBM Watson Health Cloud

IBM insiders tell me that Watson relies heavily on Spark.  Okay.  Watson is a completely opaque product, so it’s impossible to verify whether IBM powers Watson with Spark or an army of trained crickets.

SystemML to Open Source

Though initially skeptical about SystemML, as I learn more about this software I’m more excited about its potential.  Rather than simply building an interface to the native Spark machine learning library (MLlib), I understand that IBM has completely rebuilt the algorithms.  That’s a good thing — some of the folks I know who have tried to use MLlib aren’t impressed.  Without getting into the details of the issues, suffice to say that it’s good for Spark to have multiple initiatives building functional libraries on top of the Spark core.

IBM’s Fred Reiss, chief architect at IBM’s Spark Technology Center, is scheduled to present about SystemML at Spark Summit East next week.

Spark in Bluemix

IBM introduced Spark-as-as-Service in Bluemix as beta in July, 2015 and general availability in October.

The service includes Spark, Jupyter Notebooks and SWIFT object storage.  It’s a bit lost in the jumble of services available in Bluemix, but the Catalogue has a handy search tool.

As of this writing, Bluemix offers Spark 1.4.  Although two dot releases behind, that is competitive with Qubole Data Service.  Databricks is still the best bet if you want the most up to date release.

Update: An IBM executive tells me that Bluemix now uses Spark 1.6.  In the meantime, however, IBM has removed the Spark release version from its Bluemix documentation.  

IBM People to Spark Projects

3,500 researchers and developers.  Wow!  That’s a lot of butts in seats.  Let’s break that down into four categories:

(1) IBM people who actively contribute to Spark.

(2) IBM developers building interfaces from IBM products to Spark.

(3) IBM developers building IBM products on top of Spark.

(4) IBM consultants building custom applications on top of Spark.

Note that of the four categories, only (1) actually moves the Spark project forward.  Of course, anyone who uses Spark has the potential to contribute feedback, but ultimately someone has to cut code.  While IBM tracks the Spark JIRAs to which it contributes, IBM executives could not answer a simple question: of the 248 people who contributed to Spark Release 1.6, how many work for IBM?

I suspect that most of those 3,500 researchers and developers are in categories (3) and (4).

Satheesh Bandaram from the IBM Spark Technology Center replies: 26 people from STC contributed to Spark 1.6, with about 80 code commits.

Additional IBM response: Since June 2015. whenIBM announced Spark Technology Center (STC), engineers in STC have actively contributed to Spark releases:  v1.4.x, v1.5.x, v1.6.0, as well as releases v1.6.1 and v.2.0 (in progress.)  

As of today March 2, IBM STC has contributed to over 237 JIRAs and counting.  About 50% are answers to major JIRAs reported in Apache Spark.  

What’s in those 237 contributions…..
·        103 out of 237 (43%) are deliverables in Spark SQL area
·        56 of them (23%) are in MLlib module
·        37 (16%) are in PySpark module  

These top 3 areas of focus from IBM STC made up 82% of the total contributions as of today.   The rest are in the documentation, Spark Core and Streaming etc. modules.

You can track progress onthis live dashboard on github http://jiras.spark.tc/

Specific to Spark 1.6, IBM team members have over 80 commits – Majority of them from STC.  A total of 28 team members contributed to the release (25 of them from STC). Each contributing engineer is a credited contributor in the release notes of Spark 1.6.

For SparkSQL, we contributed:
·        enhancements and fixes in the new DataSet API
·        DataFrame API, Data type
·        UDF and SQL standard compliance, such as adding EXPLAIN and PrintSchema capability, and support coalesce and re-partition etc.
·         We have added support for column datatype of CHAR
·         Fixed the type extractor failures for complex data types
·         Fixed DataFrames bugs in saving long column partitioned parquet file, and handling of various nullability bugs and optimization issues
·        Fixed the limitation in Order by clause to comply with standard.  
·        Contributed to a number of UDF code fixes in completion of Stddev support.  

For Machine Learning, the STC team met with key influencers and stakeholders in the Spark community to jointly work on items on the roadmap Most of the roadmap items discussed went into 1.6.  The implementation of LU Decomposition algorithm is slated for the upcoming release.

In addition to helping implement the roadmap, here are some notable contributions:
·        We greatly improved the Pyspark distributed matrix algebra by enriching the matrix operations and fixing bugs.
·        Enhanced the Word2Vec algorithm.
·        We added optimized first through fourth order summary statistics for DataFrames (technically in SparkSQL, but related to machine learning).
·        We greatly enhanced Pyspark API by adding interfaces to Scala Machine learning tools.
·        We made a performance enhancement to the Linear Data Generator which is critical for unit testing in Spark ML.

The team also addressed major regressions on DataFrame API, enhanced support for Scala 2.11, made enhancements to the Spark History Server, and added JDBC Dialect for Apache Derby.

In addition to the JIRA activities, IBM STC also added the JDBC dialect support for DB2 and made Spark Connector for Netezza v0.1.1 available to public through Spark Packages and a developer blog on IBM external site. 

Spark Training

Like the Million Man March, “training a million people” sounds like one of those PR-driven claims that nobody expects to take seriously, especially since it’s not time-boxed.

Anyway, the details:

  • AMPLab offers occasional training in the complete BDAS stack under the AMPCamp format.  IBM funds AMPLab, but it does not appear that AMPLab is doing anything now that it wasn’t already doing last June.
  • DataCamp does not offer Spark training.
  • MetiStream offers public and private Spark training with a defined curriculum and service offering.  The training program is certified by Databricks.
  • Galvanize does not offer Spark training.
  • Big Data University offers a two-part MOOC in Spark fundamentals.

The Big Data University courses are free, and four hours apiece, so a million enrollees is plausible, eventually at least.  Interestingly, MetiStream developed the second of the two BDU courses.  So the press release should read “MetiStream and IBM, but mostly MetiStream, will train a million….”


The Year in SQL Engines

$
0
0

As an addendum to my year-end review of machine learning and deep learning, I offer this survey of SQL engines. SQL is the most widely used language for data science according to O’Reilly’s 2016 Data Science Salary Survey. Most projects require at least some SQL operations, and many need nothing but SQL.

This review covers six open source leaders: Hive, Impala, Spark SQL, Drill, HAWQ, and Presto; plus, for completeness, Calcite, Kylin, Phoenix, Tajo, and Trafodion. Omitted: two commercial options, Oracle Big Data SQL and IBM Big SQL, which IBM has not yet rebranded as “Watson SQL.”

(A reader asks: What about Druid? My response: erm. On inspection, I agree that Druid belongs in this category, so check it out.)

I use the term ‘SQL Engine’ loosely. Hive, for example, is not an engine; it’s a framework that uses the MapReduce, Tez, or Spark engines to run queries. And it doesn’t run SQL; it runs HiveQL, an SQL-like language that closely approximates SQL. ‘SQL-in-Hadoop’ is also inapt; while Hive and Impala work primarily with Hadoop, Spark, Drill, HAWQ, and Presto also work with a wide variety of other data storage systems.

Unlike relational databases, SQL engines operate independently of the data storage system. In contrast, relational databases bundle the query engine and storage into a single tightly coupled system, which permits certain types of optimization. Uncoupling them, on the other hand, provides greater flexibility, though at the potential loss of performance.

Figure 1, below, shows the relative popularity of the leading SQL engines according to DB-Engines, a website maintained by the Austrian consultancy Solid IT. DB-engines computes a monthly popularity score for more than 200 database systems. The score reflects search engine queries; mentions in online discussions; job offers; mentions in professional profiles, and tweets.

Figure 1

screen-shot-2017-01-31-at-1-04-43-pm

Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

Although Impala, Spark SQL, Drill, Hawq, and Presto consistently beat Hive on measures such as runtime performance, concurrency, and throughput, Hive remains the most popular (at least by the DB-Engines metric). There are three reasons why that is so:

— Hive is the default option for SQL in Hadoop, supported in every distribution. The others align with specific vendors and cater to niche users.

— Hive has closed the performance gap to the other engines. Most of the Hive alternatives launched in 2012 when analysts would rather kill themselves than wait for a Hive query to finish. But while Impala, Spark, Drill, et.al. ran away like rabbits back then, Hive just kept chugging along, tortoise-like, with incremental improvements. Today, while Hive is not the fastest choice, it’s a lot better than it was five years ago.

— While bleeding-edge speed is cool, most organizations know that the world does not end if a junior marketing manager has to wait ten seconds to find out if the chicken wings outperformed the buffalo burgers in the Duxbury restaurant last Tuesday.

As you can see in Figure 2, below, the top SQL engines compete well for user interest compared to leading commercial data warehouse appliances.

Figure 2

screen-shot-2017-01-31-at-2-27-15-pm

Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

The best measure of health for an open source project is the size of its active developer community. Hive and Presto have the largest base of contributors, as shown in Figure 3, below. (Data for Spark SQL is unavailable.)

Figure 3

screen-shot-2017-01-31-at-2-52-27-pm

Source: Open Hub https://www.openhub.net/

In 2016, ClouderaHortonworks, Kognitio, and Teradata waded into the Battle of the Benchmarks Tony Baer summarizes. I’m sure that you will be shocked to learn that the vendor’s preferred SQL engine outperformed the others in each of these studies, which begs the question: are benchmarks bullshit?

AtScale‘s biannual benchmark is not BS. AtScale, a BI startup, markets software that brokers between BI front ends and SQL backends. The company’s software is engine-neutral — it seeks to run on as many as possible — and its broad experience in BI gives the testing a real-world flavor.

AtScale’s key findings from its most recent round, which included Hive, Impala, Spark SQL, and Presto:

— All four engines successfully ran AtScale’s BI benchmark queries.

— Each engine has its own performance “sweet spot” depending on data volume, query complexity, and concurrent users.

– Impala and Spark SQL outperform the others in queries against small data sets

– On large data sets, Impala and Spark SQL handle complex joins better than the others

– Impala and Presto demonstrate the best results in concurrency tests

— All engines showed 2X-4X performance gains in the six months since AtScale’s previous benchmark.

Alex Woodie reports on the test results; Andrew Oliver analyzes.

Let’s dive into the individual projects.

Apache Hive

Apache Hive was the first SQL framework in the Hadoop ecosystem. Engineers at Facebook introduced Hive in 2007 and donated the code to the Apache Software Foundation in 2008; in September 2010, Hive graduated to top-level Apache project status. Every major player in the Hadoop ecosystem distributes and supports Hive, including Cloudera, MapR, Hortonworks, and IBM. Amazon Web Services offers a modified version of Hive as a cloud service in Elastic MapReduce (EMR).

Early releases of Hive used MapReduce to run queries. Complex queries required multiple passes through the data, which impaired performance. As a result, Hive was not suitable for interactive analysis. Led by Hortonworks, the Stinger initiative markedly enhanced Hive’s performance, notably through the use of Apache Tez, an application framework that delivers streamlined MapReduce code. Tez and ORCfile, a new storage format, produced a significant speedup for Hive queries.

Cloudera Labs spearheaded a parallel project to re-engineer Hive’s back end to run on Apache Spark. After an extended beta, Cloudera released Hive-on-Spark to general availability in early 2016.

More than 100 individuals contributed to Hive in 2016. The team announced Hive 2.0 in February and Hive 2.1 in June. Hive 2.0 includes improvements to several improvements to Hive-on-Spark, plus performance, usability, supportability and stability enhancements. Hive 2.1 includes Hive LLAP (“Live Long and Process”), which combines persistent query servers and optimized in-memory caching for high performance. The team claims a 25X speedup.

In September, the Hivemall project entered the Apache Incubator, as I noted in Part Two of my machine learning year-end roundup. Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run in Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team plans an initial release in Q1 2017.

Apache Impala

Cloudera launched Impala, an open source MPP SQL engine, in 2012, as a high-performance alternative to Hive. Impala works with HDFS and HBase, and it leverages Hive metadata; however, it bypasses MapReduce to run queries. Mike Olson, Cloudera’s Chief Strategy Officer,

Mike Olson, Cloudera’s Chief Strategy Officer, argued in late 2013 that Hive’s architecture was fundamentally flawed. In Olson’s view, developers could only deliver high-performance SQL with a whole new approach, exemplified by Impala. In 2014 Cloudera released a series of benchmarks in January, May, and September. In these tests, Impala showed progressive improvement in query runtime, and significantly outperformed Hive on Tez, Spark SQL, and Presto. In addition to running fast, Impala performed particularly well in concurrency, throughput, and scalability.

In 2015, Cloudera donated Impala to the Apache Software Foundation, where it entered the Apache Incubator program. Cloudera, MapR, Oracle and Amazon Web Services distribute Impala;  Cloudera, MapR, and Oracle provide commercial build and installation support.

Impala made steady progress in the Apache Incubator in 2016. The team cleaned up the code, ported it to Apache infrastructure and delivered Release 2.7.0, its first Apache release in October. The new version includes performance and scalability improvements, as well as some other minor enhancements.

In September, Cloudera published results of a study that compared Impala to Amazon Web Services’ Redshift columnar database. The report is interesting reading, though subject to the usual caveats about vendor benchmarks.

Spark SQL

Spark SQL is a Spark component for structured data processing. The Apache Spark team launched Spark SQL in 2014 and absorbed Shark, an early Hive-on-Spark project. It quickly became the most widely used Spark module.

Spark SQL users can run SQL queries, read data from Hive, or use it as means to create Spark Datasets and DataFrames. (Datasets are distributed collections of data; DataFrames are Datasets organized into named columns.) The Spark SQL interface provides Spark with information about the structure of the data and operations to be performed; Spark’s Catalyst optimizer uses this information to construct an efficient query.

In 2015, Spark’s machine learning developers introduced the ML API, a package that leveraged Spark DataFrames instead of the lower-level Spark RDD API. This approach proved to be attractive and fruitful; in 2016, with Release 2.0, the Spark team placed the RDD-based API in maintenance mode. The DataFrames API is now the primary interface for Spark machine learning.

Also in 2016, the team released Structured Streaming, in an Alpha release as of Spark 2.1.0. Structured Streaming is a stream processing engine built on Spark SQL. Users can query streaming data sources in the same manner as static sources, and they can combine streaming and static sources in a single query. Spark SQL runs the query continuously and updates results as streaming data arrives. Structured Streaming delivers exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.

Apache Drill

In 2012, a group led by MapR, one of the leading Hadoop distributors, proposed to build an open-source version of Google’s Dremel, a distributed system for interactive ad-hoc analysis. They named the project Apache Drill. Drill languished in the Apache Incubator for more than two years, finally graduating in late 2014. The team delivered its 1.0 release in 2015.

MapR distributes and supports Apache Drill.

More than 50 individuals contributed to Drill in 2016. The team delivered five dot releases in 2016. Key enhancements include:

  • Web authentication
  • Support for the Apache Kudu columnar database
  • Support for HBase 1.x
  • Dynamic UDF support

Two key Drill contributors left MapR to start Dremio in 2015; the startup remains in stealth mode.

Apache HAWQ

Pivotal Software introduced HAWQ as a commercially licensed high-performance SQL engine in 2012 and attempted to market it with minimal success. Changing strategy, Pivotal donated the project to Apache in June 2015, and it entered the Apache Incubator program in September 2015.

Fifteen months later, HAWQ remains in the Incubator. The team released HAWQ 2.0.0.0 in December, with a load of bug fixes. I suspect the project will graduate in 2017.

One small point in HAWQ’s favor is its support for Apache MADlib, the machine-learning-in-SQL project that is also still in the Incubator. The combination of HAWQ and MADlib should be a nice consolation to the folks who bought Greenplum and wonder what the hell happened.

Presto

Facebook engineers initiated the Presto project in 2012 as a fast interactive alternative to Hive. Rolled out in 2013, the software successfully supported more than a thousand Facebook users and more than 30,000 queries per day on petabytes of data. Facebook released Presto to open source in 2013.

Presto supports ANSI SQL queries across a range of data sources, including Hive, Cassandra, relational databases or proprietary file systems (such as Amazon Web Services’ S3.)  Presto queries can federate data from multiple sources.  Users can submit queries from C, Java, Node.js, PHP, Python, R and Ruby.

Airpal, a web-based query tool developed by Airbnb, offers users the ability to submit queries to Presto through a browser. Qubole provides a managed service for Presto. AWS delivers a Presto service on EMR.

In June 2015, Teradata announced plans to develop and support the project.  Under an announced three-phase program, Teradata proposed to integrate Presto into the Hadoop ecosystem, enable operation under YARN and enhance connectivity through ODBC and JDBC. Teradata offers its own distribution of Presto, complete with a data sheet. In June, Teradata announced the certification of Information Builders, Looker, Qlik, Tableau, and ZoomData, with MicroStrategy and Microsoft Power BI on the way.

Presto is a very active project, with a vast and vibrant contributor community. The team cranks out releases faster than Miki Sudo eats hot dogs — I count 42 releases in 2016. Teradata hasn’t bothered to summarize what’s new, and I don’t plan to sift through 42 sets of release notes, so let’s just say it’s better.

Other Apache Projects

There are five other SQL-ish projects in the Apache ecosystem.

Apache Calcite

Apache Calcite is an open source framework for building databases. It includes:

— A SQL parser, validator and JDBC driver

— Query optimization tools, including a relational algebra API, rule-based planner, and a cost-based query optimizer.

Apache Hive uses Calcite for cost-based query optimization, while Apache Drill and Apache Kylin use the SQL parser.

The Calcite team pushed out five releases in 2016, with bug fixes and new adapters for Cassandra, Druid, and Elasticsearch.

Apache Kylin

Apache Kylin is an OLAP engine with a SQL interface. Developed by eBay and donated to Apache, Kylin graduated to top-level status in 2015.

A startup named Kyligence launched in 2016; it offers commercial support and a data warehousing product called KAP, FWIW. While the company has no funding listed in Crunchbase, a source tells me that it has strong backing and a large office in Shanghai.

Apache Phoenix

Apache Phoenix is a SQL framework that runs on HBase and bypasses MapReduce. Salesforce developed the software and donated it to Apache in 2013. The project graduated to top-level status in May 2014. Hortonworks includes Phoenix in the Hortonworks Data Platform. Since the leading SQL engines all work with HBase, it’s not clear why we need Phoenix.

Apache Tajo

Apache Tajo is a fast SQL data warehousing framework introduced in 2011 by Gruter, a Big Data infrastructure company, and donated to Apache in 2013. Tajo graduated to top level status in 2014. The project has attracted little interest from prospective users and contributors outside of Gruter’s primary market in South Korea. Other than a brief mention by Gartner’s Nick Heudecker, the project isn’t on anyone’s dashboard.

Apache Trafodion

Apache Trafodion is another SQL-on-HBase project, conceived by HP Labs, which tells you pretty much all you need to know. HP launched Trafodion in June 2014, a month after Apache Phoenix graduated to production. Six months later, it dawned on HP executives that there might be limited commercial potential for another SQL-on-HBase engine — I can see the facepalms — so they donated the project to Apache, where it entered the Incubator in May 2015.

Trafodion promises to be a transactional database if it ever gets out of incubation. Unfortunately, there are lots of options in that space, and the only competitive benefit the development team can articulate seems to be “it’s open source, so it’s cheap.”


Data Science and Machine Learning Predictions

$
0
0

This is the time of year when everyone looks to the year ahead. Here are five four things in data science and machine learning that are utterly and completely predictable in 2018.

Data Science Matures

In the Pleistocene Era of Data Science, there were Heroes and Hackers: lone souls working on ad hoc projects with Pig, Hive, Mahout, Java, and a few prayers. For asset management, organizations used thumb drives and email. Collaboration was a non-issue because there were few others, if any, to collaborate with.

In time, organizations hired more data scientists. Heroes and Hackers evolved into Data Science Guerrillas armed with laptops and notebooks. IT didn’t want anything to do with data science; it’s messy and complicated, so it was easier to simply pretend it didn’t exist. Responsible team leaders asked contributors to store assets on Git; some complied, some didn’t, but it hardly mattered because the Git library was a disorganized mess. For tooling, data scientists used a quodlibet of languages, packages, and notebooks, which made cross-checking and peer review problematic. Nobody could agree on a common set of tools, so collaboration was rare.

“Guerrilla Data Science”

Today, data science has matured to the point that organizations expect a return on their investment. They want to see faster turnaround, and more value. Nobody cares if you won Kaggle; we want to see a minimum viable data product while we’re young.

Smart organizations adopt a collaborative model of data science. The collaborative model recognizes that the data scientist is one member of a larger team that may include business analysts, data engineers, developers, machine learning engineers, DevOps specialists, compliance specialists, security professionals, and many others all pulling together to deliver a working application.

“Collaborative Data Science”

The rise of collaborative data science leads organizations to adopt open data science platforms that do the following:

  • Provide a shared platform for all data science contributors
  • Facilitate the use of open data science tools (such as Python and R) at scale
  • Provide self-service access to data, storage, and compute
  • Support a complete pipeline from data to deployment
  • Include collaborative development tools
  • Ensure asset management and reproducibility

There are now multiple offerings in the market from vendors including Amazon Web Services, Anaconda, Cloudera, DataScience, Domino, Google, IBM, and Microsoft. In 2017, venture capitalists funded several startups in the category, which suggests that there is strong growth potential.

In 2018, look for more organizations to adopt a collaborative model of data science, and invest in an open data science platform.

Automated Machine Learning Gets Real

Forget the hype. No, automated machine learning does not mean you can fire your data scientists. Automated machine learning makes your data scientists more productive.

Several months ago, a data scientist explained to me why it takes him weeks to build a predictive model. “I have to run a hundred experiments to find the best model,” he complained, as he showed me his Jupyter notebooks. “That takes time. Every experiment takes a lot of programming, because there are so many different parameters. We cross-check everything manually to make sure there are no mistakes.”

After listening to this for an hour, I was ready to kill myself.

Automated machine learning does not eliminate the hard parts of a data scientists’ job, such as listening to clients, understanding the business problem, and figuring out how to craft a solution. It automates the stupid parts of the job. Like repetitive programming. The kind of stuff researchers delegate to interns and new hires.

Think of it like this. We’ve had robotic heart surgery for 20 years, but you don’t see cardiac surgeons standing by freeway exits holding signs that say Will Work for Food. If I have a heart problem, I’m not calling Watson — I’m going to see Dr. Angina down at University Hospital.

“The surgical robot will see you now.”

It’s the same with data science. When the CEO needs answers to really important questions, she’s not calling Watson. She’s calling the CAO or the Chief Data Scientist or whatever. Someone with skin in the game. Because when real executives delegate a task, they delegate it to someone they trust.

Organizations that want to invest in automated machine learning have plenty of commercial and open source options. Amazon Web Services, DataRobot, Google, H2O.ai, IBM, and SAS all offer automated learners; some of these are much better than others (but I’d rather hold a detailed discussion of the differences for a later post.) In the open source ecosystem, we have auto-sklearnAuto Tune Models, Auto-Weka, machine-JS, and TPOT. 

Prediction: in 2018 we’re going to see many more offerings, and more organizations will adopt the tools.

Data Scientists Discover GDPR Applies to Data Scientists

On May 25, 2018, the European Union’s General Data Protection Regulation (GDPR) takes effect. The reaction in the data science community will be something like this:

  • February: nothing
  • March: nothing
  • April: WTF is GDPR?
  • May: hair on fire

Data scientists discover that GDPR applies to them.

As I’ve written elsewhere, much of the commentary about GDPR misstates the likely impact on data science. There’s a lot of talk about the “right to an explanation,” which is actually a “right to human-in-the-loop decision-making.” But this provision applies to a narrow set of transactions, and affects front-office customer interactions more than data scientists.

GDPR’s greatest impact on data science practice is the obligation it imposes to avoid bias in predictive models used in decisions about consumers. In practice, this means that data science teams must survive an audit of their methods and procedures. Reproducibility and data lineage will be de rigueur.

That’s one more reason to put Heroes, Hackers, and Guerrillas behind you, and adopt a mature model of data science.

While GDPR sets out general principles, it leaves many details to the European Data Protection Board (EDPB). This secretariat will issue detailed guidance for controllers and processors – for example, on the data portability right, Data Protection Impact Assessments, certifications, and the role of Data Protection Officers. Like any regulator, EDPB will issue guidance over time, and the rules may be complex. Thus, compliance won’t be a matter of learning a few principles once; it will be an ongoing effort to understand requirements as they evolve.

GDPR Compliance Officer (*)

Meet the new boss, your GDPR Compliance Officer. She’s up on all the latest rulings, as well as legal requirements imposed by the separate states in which your organization operates. She’s going to engage in all of your data science projects, and she’ll tell you what you need to do to comply with the regulations.  You’re going to do whatever she tells you to do, or your work will never see the light of day.

(*) Yeah, I know — it’s Natalia Poklonskaya. No hidden political message there, I just like the picture.

Cloud, Blah, Blah, Blah, Blah…

Cloud is neither a great platform for data science nor a good platform. It’s the only logical platform.

Think of it like this. It makes sense for organizations to invest in IT infrastructure for workloads that are persistent, predictable, and mission-critical. Everything else should go to the cloud.

If you live in Manhattan and want to visit Grandma in Shrewsbury twice a year, you don’t buy a Tesla unless you’re filthy rich. You rent a ZipCar, or take an Uber.

Are data science workloads persistent, predictable and/or mission-critical? If you answered “none of the above” go to the head of the class. Data science projects are time-boxed and short-term. They require brief massive bursts of computing power. And they are rarely mission-critical.

I’m tempted to “predict” that data science will move to the cloud in 2018. Except that data science moved to the cloud a long time ago. I don’t have statistics, but here are some anecdotes:

  • 2010: RazorFish, the digital marketing agency pulls the plug on its server and moves everything to AWS
  • 2014: Data scientists at a leading US bank say they’ve moved 100% of model development to the cloud
  • 2015: A leading strategy consultancy uses a Virtual Private Cloud for 100% of its data science workloads

Analytic service providers and consultants led the way into the cloud. As variable-cost organizations, they had a huge incentive to stop investing in IT infrastructure. And, they had the skills to use the cloud back when it was hard.

It’s getting easier to use the cloud, so economic logic prevails.

Yes, there are some holdouts: organizations that prohibit use of cloud, or take a go-slow approach. But they are increasingly rare.

Predicting that data science will continue moving to the cloud is like predicting that the Mississippi River will continue flowing into the Gulf of Mexico.

IBM: Four More Quarters of Decline. Oh, Wait…

I was going to predict four more quarters of declining revenue for IBM. But then the company threw a monkey wrench into the works and reported increased sales in Q4. So, let’s offer a round of golf applause for the folks at Armonk.

But remember: the U.S.S Arizona stopped sinking when it settled into the mud.

Does this mean IBM’s big investment in Watson is finally paying off? Well, no. Take a squint at the numbers. The big jump in revenue comes from the Systems business, where IBM reports a big jump in…wait for it…System Z boxes, aka mainframes. And, in the Cognitive Solutions segment, IBM says that security and transaction processing software drove the revenue increase. You know, stuff like CICS that runs on mainframes.

So, the handful of organizations that account for most of IBM’s revenue decided that it’s easier to upgrade some of their old boxes than it is to replace them wholesale with modern architecture.

Not that there’s anything wrong with that.

Why, you ask, does IBM include software for mainframe transaction processing in its “Cognitive Solutions” business unit? Good question. One theory: when IBM reorganized, the most important consideration was to make sure that each of CEO Ginny Rometty’s one-downs had a big enough fief to justify super-sized compensation. IBM had to throw the kitchen sink into “Cognitive Solutions” to make it a suitable prince-bishopric.

Which explains why “Cognitive Solutions” has a 4% growth rate. 4% isn’t a growth story. It’s a “we’re just keeping our heads above water” story. It’s tough to grow when your business is sandbagged with the dogs IBM has collected over the years. Yes, Virginia, there is still a Red Brick Warehouse.

Each quarter, IBM breathlessly announces “wins” for Watson. Scan through the 10-Qs, however, and you know what you don’t see? The words “Watson” and “revenue” in close conjunction. That’s because auditors actually care about such things as “revenue recognition” and “materiality” and keeping BS out of the financial statements. Lest they get sued. Wall Street pleads with IBM to show some results from Watson. So you figure that if IBM actually had material Watson revenue, you’d see it in the financials.

IBM reports revenue of ~$20 billion annually for the “Cognitive Solutions” business. But industry analyst IDC estimates IBM’s actual revenue from cognitive and AI software at about $160 million. Which means that the IBM cognitive story is one part reality and 125 parts window dressing.

Keep that in mind the next time an IBM executive wants to talk to you about the power of cognitive computing.

No prediction. I just enjoy snarking at IBM.

Predicting the 2019 MQ

$
0
0

The die is cast. Last month, Gartner selected 16 vendors to include in its 2019 Magic Quadrant for Data Science and Machine Learning. Now, as Gartner prepares to publish the report early next year, I think it will be fun to make some predictions about how each vendor will fare.

Some ground rules. I’m not going to talk about DataRobot, my employer. Nor will I discuss the accuracy or value of Gartner’s assessments. People can agree or disagree with Gartner, but many people trust their analysis. Vendors invest a significant amount of time participating in the MQ; at a minimum, they believe there is a segment of customers who trust Gartner completely.

Several factors make this task hard:

— Gartner has information that is not available to the public. Vendors brief Gartner about their vision under nondisclosure. A firm that plans a major shift in strategy might disclose it to Gartner but not to the public. Gartner surveys customers.

— Gartner’s evaluation criteria are not static. Each year, Gartner adds to the list of product features and functions it uses to assess current offerings. Participating vendors can influence Gartner’s view of the future. Thus, a vendor that submits the same materials from one year to the next could see a marked decline in its rating if Gartner believes that the market as a whole has moved forward.

In making these predictions, I’m going to focus on the areas Gartner says are challenges for each vendor, and see how well the vendor has addressed these challenges since the last report.

Many of my predictions will be wrong, for the reasons cited above. But the only way to avoid prediction error is to avoid predictions.

The Gartner MQ

First, let’s talk about how Gartner evaluates vendors.

You can secure a copy of the Gartner report here if you are a client, or read it for free here, courtesy of SAS. The image below shows how Gartner positioned the vendors.

Magic Quadrant for Data Science and Machine-Learning Platforms

The horizontal dimension, Completeness of Vision (CoV), depends on Gartner’s assessment of a vendor’s market understanding and product strategy. In practice, a high score on this dimension means that the vendor’s view of the world aligns with Gartner’s view of the world. Vendors who work actively with Gartner’s analysts and act on Gartner recommendations tend to do well on CoV.

Vendors who pivot strategically and make dramatic changes (such as acquisitions or major product launches) can increase the CoV score markedly. For example, when Microsoft launched Azure Machine Learning Studio and acquired Revolution Analytics three years ago, it moved dramatically to the right on the MQ, from Niche Player to Visionary.

The vertical dimension is Ability to Execute (AtE). Product features affect this dimension, but a vendor’s product score is just one of several factors. Others include a vendor’s viability (which is relatively easy to assess) and results of the customer survey (which is not.) Vendors with strong customer experience scores from the survey tend to keep them from year to year (and vice-versa), but there are exceptions to that rule of thumb. Dataiku, for example, dropped precipitously on AtE from 2017 to 2018, primarily due to concerns surfaced in the customer survey.

Since a vendor’s product score accounts for only part of the overall AtE score, minor product enhancements do not produce dramatic movement from year to year. The most predictable improvements in AtE stem from major new products introduced relatively late in Gartner’s MQ cycle. When this happens, Gartner factors the product into a vendor’s CoV, but the product does not impact AtE until a minimum number of customers use it, typically in the following year. Gartner discloses these situations, so a careful reading of the MQ report provides information we can use to predict a vendor’s rating in the following year.

For example, H2O.ai released Driverless AI to Alpha last July. Since H2O.ai had no production customers for the product, its features did not impact H2O.ai’s product and AtE scores, but Gartner factored the release into H2O.ai’s CoV rating. This year, the product is generally available and can affect H2O.ai’s product score and AtE (if it has a sufficient number of customers.)

Vendor Assessments and Predictions

2018 Leaders

Alteryx

According to Gartner, Alteryx’s strong upward movement on Ability to Execute (AtE) in 2018 reflects its revenue growth, customer acquisition, and the IPO. The strategy behind the Yhat acquisition positively affected its Completeness of Vision (CoV.)  For 2019, the Yhat acquisition is already baked into Alteryx’ CoV score. Alteryx hasn’t materially addressed the deficiencies Gartner cited in machine learning and enterprise readiness.

Prediction: Alteryx will remain a leader, and will maintain its position on AtE, but could decline slightly on CoV.

H2O.ai

In 2018, H2O markedly improved its position on both dimensions. The improved CoV reflects the company’s development of Driverless AI and work with GPU acceleration. Since these innovations are already reflected in Gartner’s assessment, so it’s not likely that H2O will improve on CoV in 2018.

H2O’s November funding round and partnership with NVIDIA may drive an improved AtE rating. Driverless AI is a distinct product, it may or may not affect H2O’s dot position, as it is unlikely that there are enough reference customers for the product to meet Gartner’s inclusion criteria.

Prediction: H2O will improve its rating on AtE based on General Availability of Driverless AI and continued progress with GPU acceleration.

KNIME

KNIME improved markedly on CoV in 2018, retaining its position in the Leader quadrant. KNIME hasn’t materially addressed the challenges cited by Gartner in the 2018 report.

Prediction: KNIME will hold its position on AtE based on solid customer satisfaction. The company hasn’t materially addressed the challenges Gartner cited in the 2018 report and could decline on CoV.

RapidMiner

In 2018, RapidMiner declined somewhat on AtE. Product improvements in Release 8 and 9 are interesting, but they don’t address any of the concerns Gartner surfaced in the MQ. The company struggles to raise new funding, which may impair Gartner’s assessment of its viability.

Prediction: RapidMiner will continue to decline on AtE, but not by enough to fall out of the Leader quadrant.

SAS

For 2018, Gartner writes that “SAS remains a Leader, but has lost some ground in terms of both Completeness of Vision and Ability to Execute. The Visual Analytics suite shows promise because of its Viya cloud-ready architecture, which is more open than prior SAS architecture and makes analytics more accessible to a broad range of users. However, a confusing multiproduct approach has worsened SAS’s Completeness of Vision, and a perception of high licensing costs has impaired its Ability to Execute. As the market’s focus shifts to open-source software and flexibility, SAS’s slowness to offer a cohesive, open platform has taken its toll.”

Prediction: SAS seems to be singing from a new hymnal, which may be enough to boost its rating on CoV. SAS’ multi-product approach and pricing model are baked into its business model, and will not change. However, strong customer satisfaction offsets these concerns. SAS remains a leader.

2018 Visionaries

Databricks

Databricks announced support for Azure in November 2017, too late for Gartner to consider it in the 2018 MQ. The company has announced no other major moves or enhancements.

Prediction: Databricks will improve on AtE (reflecting the Azure expansion.)

Dataiku

Dataiku declined markedly on both dimensions in 2018. Releases 4.3 and 5.x of Data Science Studio address some concerns surfaced by Gartner in the 2018 report. Release 4.3 addresses issues in model deployment. Release 5.x delivers containerized workloads and support for deep learning.  

Prediction: Dataiku will improve on AtE.

Domino Data Lab

In 2018, Gartner highlighted Domino’s weakness in data access, preparation, exploration, and visualization. These are inherent in Domino’s architecture and approach, and unlikely to change.

Prediction: Domino will hold its position.

IBM

IBM fell out of the Leaders quadrant in 2018, declining on both dimensions. Gartner raised concerns about IBM’s complex and confusing mix of brands, poor customer experience, and slow improvements to SPSS. IBM is structurally unable to address these issues. The proliferation of brands reflects the many different fiefdoms within IBM, who seem to be unable to coordinate or agree on a unified strategy. Customer experience issues stem from IBM’s ongoing headcount reduction and outsourcing of customer service to low-cost venues. And IBM is simply unwilling to make significant investments in SPSS, preferring instead to develop new products under the Watson brand.

In 2018, Gartner evaluated SPSS only; Data Science Experience did not qualify for inclusion. IBM has subsequently rebranded DSX as IBM Watson Studio and rolled some SPSS features into it. This new product may or may not meet Gartner’s requirements for inclusion. Forrester likes Watson Studio a lot; personally, I think it lacks coherence. In any case, IBM Watson Studio is available in IBM Cloud only, which Gartner will see as a serious limitation.

Prediction: IBM will continue to decline on CoV. IBM Watson Studio is a modest improvement over last year’s product but is unlikely to have a sufficient number of customers to influence IBM’s AtE rating. IBM remains a Visionary.

Microsoft

Microsoft’s position in the MQ has not changed significantly in three years. Gartner attributes its score on CoV to “low scores for market responsiveness and product viability, as Azure Machine Learning Studio’s cloud-only nature limits its usability for the many advanced analytic use cases that require an on-premises option.”

Gartner has criticized Microsoft for at least two consecutive MQs for lacking an on-premises capability. This is puzzling because Microsoft clearly has such a capability in Microsoft Machine Learning Server, the latest version of the Revolution Analytics code base acquired in 2015. Gartner implies that it would have considered this product, but that Microsoft did not release the latest version until after the cutoff date for the MQ.

Prediction: Microsoft will articulate a unified vision for Machine Learning Server and Azure Machine Learning, and submit both products for evaluation, which will improve the company’s score on both dimensions. Microsoft’s recent acquisition of Bonsai will improve its CoV score.

2018 Challengers

Mathworks

Gartner notes that Mathworks’ CoV “is limited by its focus on engineering and high-end financial use cases, largely to the exclusion of customer-facing use cases like marketing, sales and customer service.” This is not likely to change.

Prediction: Mathworks will hold its position.

TIBCO

TIBCO entered this MQ when it acquired Statistica in 2017; subsequently, TIBCO acquired Alpine Data. For 2018, Gartner notes that “in terms of Ability to Execute, this Magic Quadrant evaluates only TIBCO’s ability with the Statistica platform. Other acquisitions by TIBCO contribute only to its Completeness of Vision.” Since TIBCO’s CoV score is less than that which Statistica achieved on its own, TIBCO’s acquisitions did not contribute at all to its CoV.

TIBCO has rebranded Alpine as Spotfire Data Science, but actual integration of the software is “still in its early stages” as Gartner puts it. Alpine scored near the bottom on AtE in 2016, the last time it appeared on its own. Under Dell ownership, Statistica made it into the Leader quadrant in the 2016 MQ but fell back into the Challenger quadrant in for 2017 and 2018.

Prediction: Until TIBCO can integrate the Alpine code base with Statistica and Spotfire, the acquisition has no impact on TIBCO’s AtE score. If TIBCO can articulate a coherent vision for the acquired products, its CoV will improve.

2018 Niche Players

Anaconda

Anaconda holds a weak position in the Niche quadrant and has made no significant moves since last year.

Prediction: Anaconda remains a Niche player.

Angoss

Datawatch acquired Angoss in January 2018. Datawatch owns Monarch Panopticon, which appeared in the 2016 MQ for BI but not since then.

Prediction: Datawatch is a bottom feeder, and is unlikely to participate in the MQ.

SAP

SAP entered the MQ when it acquired KXEN in 2013. Since then, it has steadily declined on AtE, which is somewhat surprising for an organization with SAP’s deep pockets. Gartner notes that SAP Leonardo was not a factor in the company’s AtE rating and that the Leonardo vision seems decoupled from the existing product. In any case, Leonardo also was not a factor in SAP’s CoV rating, which increased only slightly.

SAP’s vision, customer experience, and product development issues appear to be structural, and unlikely to change.

Prediction: SAP will continue to decline in AtE. The Leonardo rollout will not improve SAP’s CoV position for this MQ.

Teradata

Gartner writes that Teradata’s “lack of cohesion and ease of use on the data science development side have impaired both its Ability to Execute and its progress on the Completeness of Vision axis. It remains a Niche Player.”  Teradata recently retired Aster, a core component of its data science offering.

Teradata executives talked a lot about machine learning at its recent event. Teradata talking about machine learning is like Mel Gibson talking about his boobs. He doesn’t have any, and nobody would care if he did.

Prediction: Teradata will be kicked out of the MQ entirely.

New Entrants

There are likely to be two spots open in the MQ for new entrants. In 2018, Gartner cited the following vendors for Honorable Mention:

  • Amazon
  • Big Squid
  • DataRobot
  • DataScience.com
  • FICO
  • Google
  • Megaputer
  • Pitney Bowes

Of these, Amazon has the best chance to join the MQ.

  • Amazon actively participates in other Gartner MQs, with some success. Sagemaker and AWS AI/ML services are a powerful combination. AWS has a good track record in articulating a vision that satisfies with Gartner’s view of the world. If AWS wants to jump in, they could qualify almost as well as Databricks.
  • Big Squid is not ready for the big leagues.
  • DataScience.com had a decent product but struggled commercially. The acquisition by Oracle makes the company viable, but Oracle’s rebranded version of the product will not be available until June 2019.
  • FICO lacks the broad industry appeal required for this MQ
  • Google Cloud DataLab is Google’s equivalent to Sagemaker, and not nearly as good. Google’s automated machine learning is still in Alpha release, and it supports image recognition only. Google doesn’t seem to care much about competing in Gartner MQs.
  • Megaputer dropped out of the MQ in 2017 and isn’t likely to drop back in.
  • Pitney Bowes is just mailing it in at this point.

Cloudera is another potential new entrant. Cloudera did not meet Gartner’s inclusion criteria for the 2018 MQ but might qualify for the 2019 MQ. Cloudera Data Science Workbench is not as good as Domino Data Lab and Databricks, but better than Anaconda. If Cloudera enters, it will be a Niche Vendor.


2018 in AI/ML

$
0
0

Well, 2018 is dead and gone. Time to take a look back at the year in AI/ML.

A reminder that I work for DataRobot. This is my personal blog. Opinions are mine.

On the Move

It’s hard to believe that Amazon Web Services introduced Amazon SageMaker just a year ago, but here we are. AWS moved aggressively to enhance the service with new native and partner capabilities. The service still targets the experienced AWS developer, but AWS can move upmarket if it chooses. Joyent CTO Brian Cantrill tweets:

Am waiting for the year that reInvent goes full Red Wedding, locking the doors and announcing that every attendee’s product or service is now a forthcoming AWS offering. Or maybe that was this year?

Reinforcement learning is among the new bits in SageMaker. If you want to push your toddler into an AI career, AWS announced DeepRacer, an autonomous model race car. Just in time for Christmas.

AWS also announced Amazon Forecast at re:Invent. Amazon Forecast is a managed service for time series analysis. As a global retailer, Amazon has a huge forecasting problem and skills to match. Marketing materials stress AWS’ deep experience in retail forecasting, a credential that would appeal to retailers if retailers wanted to do business with AWS. They don’t however, so AWS might want to shut up about its retailing chops.

Dataiku did not do well in Gartner’s 2018 MQ, dropping like a stone to the bottom of the Ability to Execute axis. Not quite the bottom: Dataiku did better than Teradata. “We beat Teradata” gives cold comfort when you note that Teradata deprecated Aster and exited the category.

To its credit, Dataiku responded to the issues that Gartner surfaced, adding a deployment API and containerized engines. Customer ratings in Gartner PeerInsights look strong, which bodes well for Dataiku’s position in the 2019 MQ.

Update: Dataiku announces a $101 million Series C round of venture capital. Iconiq Capital leads the round, with Alven Capital, Battery Ventures, Dawn Capital and FirstMark Capital also participating. Alven, Battery, and FirstMark all participated in Dataiku’s B round in 2017.

Databricks wants to be more than “the people who invented Apache Spark.” That’s a self-limiting value proposition: to the extent that Spark matures and stabilizes, customers are less likely to need Matei Zaharia to tune their RDDs.

In its first few years, Databricks was little more than Spark on AWS, and its main emphasis was on data engineering. Starting last year, the company pivoted towards the AI and machine learning market. This meant supporting a wider range of Java, Python, R, and Scala packages as well as Spark ML and Spark packages. Databricks also broadened its support for deep learning frameworks, including TensorFlow, MXNet, Keras, PyTorch, Caffe and Microsoft Cognitive Toolkit.

The pivot pays dividends. Earlier this year, Databricks scored a “Visionary” rating in Gartner’s 2018 MQ, thanks to its innovation and scalability. In June, Databricks introduced MLFlow for tool integration, experiment tracking, reproducibility, and model deployment. That’s a positive step towards an integrated data science and machine learning platform.

DataRobot (my employer) had a good year. The company released time series forecasting, model management, and model monitoring, among other things. In October, DataRobot closed a $100 million Series D round led by Sapphire Ventures and Meritech Capital Partners. Privately held startups don’t release financials, so the only way to tell which startups are going somewhere is to follow the money. Venture capitalists don’t back losers.

H2O.ai opened the champagne when Gartner released its 2018 MQ. Getting named to the Leader quadrant is a big deal, and it takes a lot of work to get there. Driverless AI is now available in all three cloud marketplaces. You can also install it on-premises on an NVIDIA GPU-accelerated box or an IBM “Minsky” server.

H2O.ai shipped 32 versions of Driverless AI this year, a fact that can be interpreted in more than one way. Agile sprints sound cool at hacker meetups, but enterprise customers don’t like to upgrade software every six days. Major product enhancements include GLM, time series, an alpha release of TensorFlow, and a text analytics capability that depends on TensorFlow, so I guess that’s in alpha too.

H2O World attracted “record audiences” when it played New York and London. It was the first time the show played those venues, so an audience greater than zero breaks the record.

Muddling Through

You have to give Alteryx credit: they know how to bang the cash register and collect money. Alteryx sellers delivered 50% revenue growth and a thousand new logos each quarter this year. That’s a great track record.

However, the average starting sale is small, and account expansion is minuscule. Sooner or later you run out of new logos to land.

In AI/ML, Alteryx did little with the Yhat assets it acquired last year. This is not too surprising. When your flagship product runs only on Windows and you acquire software that runs only on Linux, it can take time to consummate the marriage.

Speaking of marriage, it’s an open secret that Alteryx plans to partner with H2O.ai. Alteryx CEO Dean Stoecker will keynote H2O World SFO in February. We’ll see how that works out.

Google just muddling through? WTF? Google deprecated its Prediction API earlier this year. The AutoML product line will support vision, speech, and translation if it ever gets out of the beta. And Cloud DataLab isn’t winning any prizes. So right now Google doesn’t have a horse in the predictive analytics race.

Nobody cares about TPUs unless you can do something with them.

IBM shoved IBM Watson Studio out the door in time to make the key analyst reports. Forrester awarded IBM a top rating in the revamped “Wave” for MultiModal Predictive Analytics, which is nice. Gartner’s MQ isn’t available until February, so we’ll just have to wait and see what Gartner thinks. Gartner gives more weight to customer feedback, which has been a dumpster fire for IBM the last few MQs. It’s the main reason Gartner kicked IBM out of the “Leaders” quadrant earlier this year. Oh, the shame of it.

In my view, IBM Watson Studio exemplifies what IBM does best: taking old things and calling them new things. The service seems like a quodlibet of existing services, with an additional splash of a new SPSS that looks strikingly like the old SPSS. It’s available in IBM Cloud, everyone’s last choice in cloud platforms.

Incidentally, IBM peddles the line that they are the biggest contributor to Spark’s machine learning library. They might want to reconsider using that factoid. In the past two years, I can’t think of a single new feature added to Spark ML. And data scientists surveyed by KDNuggets in 2018 were less likely to use Spark than those polled in 2017.

Mathworks seems to be doing a better job at analyst relations. I don’t have much to say about Mathworks. Like SAS, it’s the object of desire for a cult of users who will give it up when you wrench it from their cold dead hands.

Microsoft shuffled the executive chairs in AI/ML leadership. This could mean something or it could mean nothing. Microsoft made a splash a couple of years ago when it introduced Azure Machine Learning Studio and acquired Revolution Analytics. Since then, all seems quiet in Redmond. The company has dozens of products and services for machine learning and AI, all of which seem to be managed in silos. There’s no coherent product strategy that bridges cloud and on-premises computing, which is surprising given the market strength of Azure.

RapidMiner launched new enhancements for data prep, feature engineering, and automated model training. Even so, and despite endorsements from Forrester and Gartner, the company hasn’t landed new venture capital in more than two years. It’s unclear why this is so. Some sources attribute it to the history of litigation with TIBCO. Others think that RapidMiner’s adoring users of its free software aren’t willing to pay for what they use. You know, the classic “students and hoboes” problem that makes it hard to monetize open source software.

Perhaps TIBCO will buy Rapidminer. Speaking of which, TIBCO seems to be doing better at analyst relations. Forrester awarded them an above average rating in strategy, FWIW.

SAS published a lot of blog posts this year. The company has a blog farm that spews content like Mount Vesuvius in full eruption. Product-wise, however, all seems quiet in Cary. The June Viya release was underwhelming.

SAS revenue in 2017 was $3.24 billion. If it exceeds $3.40 billion in 2018, I’ll buy Dave Mac lunch at Yum Yum Sushi Thai on Harrison Oaks.

Not Really Muddling

Ayasdi killed off its Federal business and lost its CEO. Headcount declined from 130 to 109 over the year. Ayasdi’s last funding round was three and a half years ago. As a rule, startups that can’t or won’t refund aren’t thriving.

Cloudera announced plans to offer Cloudera Data Science Workbench in the cloud. I doubt that AWS is shaking in its boots about that. CDSW’s main advantage is its integration with Cloudera. Put the same product in the cloud and you have a notebook-based platform for Python and R that doesn’t support Jupyter. I could make a joke about that, but out of respect for my former colleagues at Cloudera, I won’t.

Oracle acquired DataScience.com in June, and on the strength of that secured a “Leader” rating in the Forrester Wave for notebook-based data science platforms. Then everyone died or something because Oracle Data Science Cloud won’t be available until June 2019.

Notice how most of the vendors in the Forrester “Wave” have little circles, but Oracle has just a little dot?

Now flip to the table on page 8 of the report. See that 0.00 under Customer adoption for Oracle? It means that nobody uses the Oracle product.

Oracle Data Science Cloud. It’s the leading product that nobody uses.

Regular readers of this blog may recall that we celebrate the passing of AI/ML vendors with bye-ku. Here’s one for DataScience.com.

A bright future for

Data Science dot com? No…

No, no, no, no, no.

Gartner rated KXEN a Challenger the year before SAP bought the company in 2013. Since then, SAP just keeps sinking deeper into Niche Vendor territory. (The niche, it seems, is “SAP bigots.”) According to Gartner:

SAP has one of the lowest overall customer satisfaction scores in this Magic Quadrant. Its reference customers indicated that their overall experience with SAP was poor, and that the ability of its products to meet their needs was low. SAP continues to struggle to gain mind share for PA across its traditional customer base. SAP is one of the most infrequently considered vendors, relative to other vendors in the Magic Quadrant, by those choosing a data science and machine-learning platform.

In other words, existing customers think SAP sucks, and everyone else stays well clear.

The product itself seems little changed since 2013 or, for that matter, since first launched by KXEN in 1999. Meanwhile, SAP plays this fun game called “Where’s Leonardo?” where they tease everyone with all the wonderful capabilities that will be built into SAP Leonardo while hiding the actual product. Quoting Gartner again:

SAP Leonardo Machine Learning and other components of the SAP Leonardo ecosystem did not contribute to SAP’s Ability to Execute position in this Magic Quadrant.

Translation: we can’t evaluate slideware.

Teradata deprecated Aster, finally, and no longer has a product in the data science and machine learning category. Curiously, at Teradata events, top management still talks about machine learning. Here’s a bye-ku for Teradata’s machine learning business:

Teradata says

It has machine learning but

Sadly it does not.

Goodbye, Teradata. You have enough work to do getting those database customers to stick.

Predictions for 2019

$
0
0

It’s that time of year again. Time to drive stakes in the ground about the year ahead.

Looking Back

First, a brief look back to see if last year’s predictions aged well. You can read them here. Recap below.

My predictions come with a lifetime guarantee: if you are not completely satisfied with them, I will return to you, in cash, the amount you paid to read them.

Data Science Matures. It’s hard to measure data science maturity, which makes this an excellent topic for predictions. A sign that the Wild West days of data science are over: growing interest in collaborative data science platforms. Platforms like Cloudera Data Science Workbench, Dataiku Data Science Studio, and Domino Data Science Platform are doing well in the marketplace. There is enough commercial interest that Forrester now evaluates these tools separately from tools that target “citizen” data scientists.

Automated Machine Learning Gets Real. You might think that last year I was plugging DataRobot, my current employer. It’s the other way around. I wrote the prediction before joining DataRobot.

Here is a list of vendors in the data science and machine learning space who claim to offer automation:

  • Amazon Web Services
  • Ayasdi
  • BigML
  • Dataiku
  • Databricks
  • DataRobot
  • Google Cloud Platform
  • H2O.ai
  • IBM
  • KNIME
  • Microsoft Azure
  • RapidMiner
  • Salesforce
  • SAP
  • SAS
  • SparkBeyond
  • SparkCognition
  • TIBCO
  • Willis Towers Watson

That’s not a complete list by any means. As of this writing, there are 3,808 companies listed in Crunchbase in the Machine Learning category. Many of these claim to automate the process.

Some of those claims are BS. But the proliferating message tells you two things about automation. First, people who say that you cannot automate machine learning are losing the argument. Second, the idea of automation resonates with customers.

Data Scientists Discover GDPR Applies to Data Scientists. Much of the current interest in interpretability stems from a perception that GDPR mandates explainable models. It actually doesn’t do that — there is much nonsense written about GDPR — but it’s fair to say that a lot of hair caught fire on May 25.

Much of the nonsense about GDPR stems from the fact that GDPR itself sets out broad aims, not compliance detail. GDPR designates the European Data Protection Board (EDPB) as the compliance authority. EDPB is still getting up and running: looking for office space in Brussels, shopping for furniture, checking out the Moules Marinières at Chez Leon, figuring out how to give directions to the loo in 24 languages.

So far, how many binding decisions have EDPB issued? None. Aucun. Keiner. Nessuna. Ninguna. Nenhum. Geen. Nici unul. Żaden. Žádný. Jih ni. Nijedan. Bat ere ez. Ingen. Puudub. Κανένας. Níl aon cheann. Nė vienas. Нито един. Nav. Nikto. Egyik sem. Ei mitään.

In other words, you may think you know what GDPR means for data science practice, but you don’t. So STFU.

Cloud, Blah, Blah, Blah, Blah…. Yeah. That was a layup.

IBM: Four More Quarters of Decline. Oh, Wait… After a brief flirtation with revenue growth in Q4 2017, IBM returned to its declining ways in Q1 2018. I’m sure that the 18% increase in Q4 2017 sales immediately followed by a 15% decline in Q1 2018 sales was purely happenstance and not the result of IBM sellers gaming 2017 sales.

Note that IBM’s hardware business is its strong point. In Q3 2018, revenue for “Cognitive Solutions,” including all of the Watson stuff, declined 6%. Technology Services and Cloud Platforms declined 2%.

Looking Ahead

So, what’s coming in 2019? Here are three predictions. (A reminder that I work for DataRobot. My bias is obvious. Opinions are mine, not DataRobot’s.)

AI strategy moves to the front of the queue. What’s the most significant constraint holding up investment in AI? Hiring skilled people, right? Wrong. According to McKinsey, the biggest obstacle to AI adoption is the lack of a clear strategy.

Big whoop, you think. A company that sells strategy says you need more strategy. My local fish merchant says I should eat more herring.

But McKinsey is right. Many organizations simply do not know what they want to do with AI.

Moreover, the absence of strategy complicates hiring. Do you need more data engineers or do you need folks who understand the ins and outs of recurrent neural networks? You can’t possibly know how to staff up if you haven’t decided what you’re going to do.

Chief AI Officers Emerge. Who owns AI in the organization?

Some folks think that the Chief Data Officer should own AI. That’s a terrible idea.

Data is like office furniture. You need data for AI. You also need furniture, so your data scientists have a place to sit. But nobody talks about putting the office manager in charge of AI.

Put an executive in charge of X, and you get more X. CDOs want your business to hoover up every speck of data and store it in cavernous data warehouses or vast data lakes that nobody uses. They measure success in petabytes.

Hey, look at all these petabytes.

My data lake has more petabytes than your data lake.

As if you need more petabytes. You don’t use 80% of the petabytes you already have.

Put a CDO in charge of AI, and she’ll spend her time trying to prove that AI justifies her previous investment in a vast data lake. Hey, look. Our new AI says people buy beer and diapers together. Isn’t that special? We would not know that if we didn’t have this data lake.

By the way, the “beer and diapers” story isn’t apocryphal. In 1992, a Teradata team analyzed 1.2 million market baskets from 25 drug stores. They discovered that between 5:00 p.m. and 7:00 p.m. customers purchased beer and diapers together. Teradata spent the next ten years touting that discovery at trade shows.

The most important part of the story is the part Teradata didn’t talk about: the customer never did anything with that “insight.” The customer, it seems, didn’t give two fucks what people put in their baskets as long as they paid.

In every large database, there are millions of patterns. Of these, people care about a select few.

You need a Chief AI Officer tasked with building AI that people actually use. Put an executive in charge of AI adoption, and you will get more AI adoption. This is a good thing.

Democratization gets real. Remember way back in 1994, when ISL introduced Clementine? Clementine was the first machine learning package to offer a drag-and-drop icon-based user interface. (SPSS acquired ISL, and eventually rebranded the software as SPSS Modeler. IBM acquired SPSS, and rebranded it as Watson Machine Learning.)

SPSS Modeler and SAS Enterprise Miner were supposed to democratize data science. Same for Alpine, Alteryx, BigML, Dataiku, IBM Watson Analytics, IBM Watson Studio, KNIME, Microsoft Azure Machine Learning Studio, Predixion Software, RapidMiner, and Statistica. With so many drag-and-drop tools on the market for so many years, data science must be entirely democratized by now.

Oh, wait. It isn’t.

Analysts who keep predicting the triumph of citizen data scientists ought to pause and ponder why not.

First, people have to want to do data science. It might surprise you to learn that not everyone cares about machine learning. A friend and former colleague is a senior executive at one of the larger consumer banks in North America. She plays a crucial role in risk management and credit policy. She doesn’t tell her friends that the bank needs more LSTM neural nets for text analytics, because they’re so much better than bag-of-words.

Second, making something easier doesn’t necessarily end the need for specialists. We’ve had practical robotic surgery for fifteen years. That doesn’t mean I’m going to rent a surgical robot at Home Depot and remove my own gall bladder if I get acute cholecystitis. I’ll go and see Dr. Cutter at Mass General. If she uses a robot, that’s her business.

Third, people who are serious about putting AI to work in your organization don’t care whether data science tools are “easy to use.” They will walk across beds of hot coals to get data they need and eat nails to tune a model. “Code-free” isn’t a benefit for people who learned how to code ten years ago.

AI democratization isn’t all about tools. It’s about your organization:

  • Redefining roles
  • Rethinking the workflow
  • Finding better ways to identify and prioritize AI opportunities

Democratization does not mean that the receptionist will build models between phone calls. It means that the receptionist contributes to an AI project that will improve customer call handling. People contribute to projects in many different ways. They don’t all need to know how to hyperparameterize a Generalized Linear Model.

Vendors that understand this will thrive. They will help customers define strategy, define roles, rethink the workflow, and identify AI opportunities.

Vendors that think democratization is all about software won’t thrive. They will continue to believe that if they can add one more icon to the drag-and-drop palette, data science will be totally democratized. They will wonder why customers don’t flock to their tools.

That is all.





Latest Images