• eBay

    Discussion of eBay’s use of database and analytic technology. Related subjects include:

    March 17, 2015

    More notes on HBase

    1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:

    Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.

    2. The discussion of which kinds of data are originally put into HBase was a bit confusing.

    OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.

    3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:? Read more

    September 21, 2014

    Data as an asset

    We all tend to assume that data is a great and glorious asset. How solid is this assumption?

    *”Our assets are our people, capital and reputation. If any of these is ever diminished, the last is the most difficult to restore.” I love that motto, even if Goldman Sachs itself eventually stopped living up to it. If nothing else, my own business depends primarily on my reputation and information.

    This all raises the idea — if you think data is so valuable, maybe you should get more of it. Areas in which enterprises have made significant and/or successful investments in data acquisition include:? Read more

    March 12, 2012

    Kinds of data integration and movement

    “Data integration” can mean many different things, to an extent that’s impeding me from writing about the area. So I’ll start by simply laying out some of the myriad ways that data can be brought to where it is needed, and worry about other subjects later. Yes, this is a massive wall of text, and incomplete even so — but that in itself is my central point.

    There are two main paradigms for data integration:

    Data movement and replication typically take one of three forms:

    Beyond the core functions of movement, replication, and/or federation, there are other concerns closely connected to data integration. These include:

    In particular, the following are largely different from each other. Read more

    November 16, 2011

    QlikView 11 and the rise of collaborative BI

    QlikView 11 came out last month. Let me start by pointing out:

    *One confusing aspect to that paper: ?non-standard uses of the terms “analytic app” and “document”.

    As QlikTech tells it, QlikView 11 adds two kinds of collaboration features:

    I’d add a third kind, because QlikView 11 also takes some baby steps toward what I regard as a key aspect of BI collaboration — the ability to define and track your own metrics. It’s way, way short of what I called for in metric flexibility in a post last year, but at least it’s a small start.

    Read more

    October 19, 2011

    What those nested data structures are about

    As I’ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.

    The explanation was led by Oliver Ratzesberger, late of eBay* and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs, which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:

    *Edit: Oliver subsequently moved on to Sears and then Teradata.

    There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What’s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can’t always be reliably reproduced from one query to the next. (That’s just one of several reasons text search and relational DBMS are an awkward fit.)

    Also, there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn’t necessarily make a lot of sense.

    September 23, 2011

    Some notes on Hadoop (mainly) and appliances

    1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says “Hooray! Our appliance is all-in-one!” Big whoop.

    2. That said, the Hadoop part of EMC ‘s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don’t reply “MapR is full of &#$!” Rather, they say “We’re going to close the gap with MapR a lot faster than the MapR folks like to think — and by the way, guys, thanks for the butt-kick.” A lot more precision about MapR may be found in this M. C. Srivas SlideShare.

    3. On its latest earnings call, Oracle clearly said it would introduce a Hadoop appliance, versus just hinting at a Hadoop appliance the prior quarter. The money quote was:? Read more

    October 22, 2010

    Notes and links October 22, 2010

    A number of recent posts have had good comments. This time, I won’t call them out individually.

    Evidently Mike Olson of Cloudera is still telling the machine-generated data story, exactly as he should be. The Information Arbitrage/IA Ventures folks said something similar, focusing specifically on “sensor data” …

    … and, even better, went on to say:? Read more

    October 6, 2010

    eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more

    I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included:? Read more

    July 31, 2010

    Nested data structures keep coming up, especially for log files

    Nested data structures have come up several times now, almost always in the context of log files.

    I don’t have a grasp yet on what exactly is happening here, but it’s something.

    June 30, 2010

    Cloudera Enterprise and Hadoop evolution

    I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I’d say:? Read more

    Next Page →

    Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


    Search our blogs and white papers

    Monash Research blogs

    User consulting

    Building a short list? Refining your strategic plan? We can help.

    Vendor advisory

    We tell vendors what's happening -- and, more important, what they should do about it.

    Monash Research highlights

    Learn about white papers, webcasts, and blog highlights, by RSS or email.

  • 大蠃家足球即时比分