- RSS Channel Showcase 8092476
- RSS Channel Showcase 8424504
- RSS Channel Showcase 4721427
- RSS Channel Showcase 3336261
Articles on this Page
- 01/02/14--05:45: _Analytic Startups: ...
- 01/02/14--13:15: _Apache Spark for Bi...
- 01/03/14--05:56: _Analytic Startups: ...
- 02/05/14--11:28: _Machine Learning in...
- 02/12/14--09:50: _Machine Learning in...
- 04/09/14--08:11: _Automated Predictiv...
- 04/22/14--08:34: _Analytic User Personas
- 04/23/14--08:26: _Python for Analytics
- 05/01/14--09:41: _Distributed Analyti...
- 05/19/14--10:08: _How to Optimize You...
- 11/04/14--05:32: _SAS in Hadoop: An U...
- 12/01/14--06:00: _SAS Versus R (Part 1)
- 12/15/14--06:00: _SAS Versus R Part Two
- 02/17/15--05:00: _Software for High P...
- 04/30/15--06:08: _How to Buy SAS Visu...
- 06/12/15--14:12: _Spark 1.4 Released
- 10/05/15--10:04: _O’Reilly Data Scien...
- 10/07/15--09:24: _Benchmark: Spark Be...
- 02/13/16--13:47: _IBM and Spark (Upda...
- 01/31/17--21:32: _The Year in SQL Eng...
- 01/02/14--05:45: Analytic Startups: 0xdata (Updated May 2014)
- 01/03/14--05:56: Analytic Startups: Skytree
- 02/05/14--11:28: Machine Learning in Hadoop: Part One
- 02/12/14--09:50: Machine Learning in Hadoop: Part Two
- 04/09/14--08:11: Automated Predictive Modeling
- 04/22/14--08:34: Analytic User Personas
- 04/23/14--08:26: Python for Analytics
- 05/01/14--09:41: Distributed Analytics: A Primer
- 05/19/14--10:08: How to Optimize Your Marketing Spend
- 11/04/14--05:32: SAS in Hadoop: An Update
- 12/01/14--06:00: SAS Versus R (Part 1)
- 12/15/14--06:00: SAS Versus R Part Two
- 02/17/15--05:00: Software for High Performance Advanced Analytics
- 04/30/15--06:08: How to Buy SAS Visual Analytics
- 06/12/15--14:12: Spark 1.4 Released
- 10/05/15--10:04: O’Reilly Data Science Survey 2015
- 10/07/15--09:24: Benchmark: Spark Beats MapReduce
- 02/13/16--13:47: IBM and Spark (Updated)
- 01/31/17--21:32: The Year in SQL Engines
Updated May 22, 2014 0xdata (“Hexa-data”) is a small group of smart people from Stanford and Silicon Valley with VC backing and an open source software project for advanced analytics (H2O). Founded in 2011, 0xdata first appeared on analyst dashboards in 2012 and has steadily built a presence in the data science community since then. […]
Updated and bumped July 10, 2014. For a powerpoint version on Slideshare, go here. Introduction Apache Spark is an open source distributed computing framework for advanced analytics in Hadoop. Originally developed as a research project at UC Berkeley’s AMPLab, the project achieved incubator status in Apache in June 2013 and top-level status in February 2014. According to one analyst, Apache […]
Skytree started out as an academic machine learning project developed at Georgia Tech’s Fastlab. Leadership shopped the software to a number of software vendors prior to 2011 and, finding no buyers, launched as a standalone venture in 2012. In April 2013, Skytree announced Series A funding of $18 million, with backing from U.S. Venture Partners, […]
Much has changed since I last blogged on this subject a year ago (here and here). This is the first of a three-part blog covering the current state of play for machine learning in Hadoop. I use the term “machine learning” deliberately, to refer to tools that can learn from data in an automated or […]
This is the second of a three-part series on the current state of play for machine learning in Hadoop. Part One is here. In this post, we cover open source options. As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics […]
A colleague asks: can we automate predictive modeling? How we answer the question depends on the context. Consider the two variations on the question below, with more precise wording: Can we completely eliminate the need for expertise in predictive modeling — so that an “ordinary business user” can do it? Can we make expert […]
Analytic users are not all the same; in most organizations, there are a number of different user “personalities”, or personas, with distinct needs. If you develop an analytics architecture for your organization or develop analytic software to sell to others, it is important to understand these personas. In this essay, I profile four personas: Power […]
user personasthomaswdinsmoreGoogle Trends Data Scientist
A reader complains that I did not include Python in a survey of Machine Learning in Hadoop. It’s a fair point. There was a lively debate last year between R and Python advocates, variously described as a war or a boxing match. Matt Asay argued that Python is displacing R; Sharon Machlis and David Smith countered. In […]
pythonthomaswdinsmoreStrata Tool Correlation
Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic. The question is important for four reasons: Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or […]
There are formal methods and tools you can use to optimize marketing spend, including software from SAS, IBM and HP (among others). The usefulness of these methods, however, depends on basic disciplines that are missing from many Marketing organizations. In this post I’d like to propose some informal rules for marketing optimization. These do not exclude using […]
Digital Marketing Program Spend 2009 – 2016thomaswdinsmore
SAS supports several different products that run “inside” Hadoop based on two different in-memory architectures: (1) The SAS High Performance Analytics suite, originally designed to run in dedicated Teradata and Greenplum appliances, includes five modules: Statistics, Data Mining, Text Mining, Econometrics and Optimization. (2) A second set of products — SAS Visual Analytics, SAS Visual Statistics and […]
Which is better for analytics, SAS or R? One frequently sees discussions on this topic in social media; for examples, see here, here, here, here, here and here. Like many debates in social media, the degree of conviction is often inverse to the quantity of information, and these discussions often produce more heat than light. […]
In a previous post, I summarized some myths about SAS and R — arguments offered by proponents of one or the other that deserve to be dismissed. In this post, I will review some arguments that do make sense — things to consider if you are an aspiring analyst or if you are an executive […]
Strata+Hadoop World week is a good opportunity to update the list of platforms for high-performance advanced analytics. Vendors are hustling this week to announce their latest enhancements; I’ll post updates as needed. First some definition. The scope of this analysis includes software with the following properties: Support for supervised and unsupervised machine learning Support for distributed […]
Stories about SAS Visual Analytics are among the most widely read posts on this blog. In the last two years I’ve received many queries from readers who complain that it’s hard to get clear answers about the software from SAS. In software procurement, the customer has bargaining power until the deal closes; after that, power […]
lady-pushing-a-shopping-cart-in-the-supermarket-2thomaswdinsmoreSASVA vs TableauHP4VA
On June 11, the Spark team announced availability of Release 1.4. More than 210 contributors from 70 different organizations contributed more than 1,000 patches. Spark continues to expand its contributor base, the best measure of health for an open source project. Spark Core The Spark team continues to improve Spark operability, performance and compatibility. Key enhancements […]
maxresdefaultthomaswdinsmoreScreen Shot 2015-06-12 at 2.00.20 PM
O’Reilly releases its 2015 Data Science Salary Survey. The report, authored by John King and Roger Magoulas summarizes results from an ongoing web survey. The 2015 survey includes responses from “over 600” participants, down from the “over 800” tabulated in 2014. The authors note that the survey includes self-selected respondents from the O’Reilly audience and […]
data-scientistthomaswdinsmoreScreen Shot 2015-10-05 at 11.19.29 AM
A group of scientists affiliated with IBM and several universities report on a detailed analysis of MapReduce and Spark performance across four different workloads. In this benchmark, Spark outperformed MapReduce on Word Count, k-Means and Page Rank, while MapReduce outperformed Spark on Sort. On the ADT Dev Watch blog Dave Ramel summarizes the paper, arguing that it “brings into question..Databricks Daytona GraySort […]
Updated March 8, 2016. After publishing this post, I met with several IBM executives at Spark Summit East, who confirmed the accuracy of the original post and provided additional detail, which I’ve included in this version. Updates are in bold red italics. IBM also provided the low-resolution image. IBM has a good story to tell — […]
IBM SparkthomaswdinsmoreScreen Shot 2016-02-13 at 2.48.21 PMScreen Shot 2016-02-13 at 3.11.35 PM
As an addendum to my year-end review of machine learning and deep learning, I offer this survey of SQL engines. SQL is the most widely used language for data science according to O’Reilly’s 2016 Data Science Salary Survey. Most projects require at least some SQL operations, and many need nothing but SQL. This review covers six open source leaders: Hive, Impala,