It’s All Getting Data Driven, Dudes!

Standard

This is about science but applies to all professional work.  Yes, including marketing and business, of course.

Take Aways

  • Perhaps all science wants to be data-driven but could not be, until now.
  • Despite our familiarity with the aphorism “information wants to be free,” the story of Brahe, Kepler, and the Rudolphine Tables reminds us that freedom occurs only after a great deal of work, complexity, and expense.
  • Information may want to be free, but data is not as driven. In fact, data is pretty much a slacker: it might have a lot of potential, but it’s not going anywhere unless you stay on top of it.
  • ….The reality is that science is becoming data-driven at a scale previously unimagined. The ubiquity of access and the volume of data will fundamentally transform the scientific process.

    Data-driven science is not necessarily new: a compelling argument can be made that the astronomer Tycho Brahe and his assistant Johannes Kepler were doing data-driven science, at least by the scale of their time. Kepler published the Rudolphine Tables in 1627, some twenty-six years after Brahe’s death. The tables were a catalog of stars and planets and were largely based on Brahe’s observations, which were considered to be the most accurate and detailed of the time. The Rudolphine Tables formed the core of the data that Kepler used to derive his laws of planetary motion. That the Rudolphine Tables were published at all is amazing: significant infrastructure costs (in the form of purpose-built observatories), professional jealousies, intellectual property restrictions, and political and religious instabilities dominate this story. The cost, the scale, and the legal and social concerns involved in the story of the Rudolphine Tables make it the Google Book project of the seventeenth century.

    What Drives Advancement?
    Data is the engine that drives all scientific paradigms. The scientific paradigms can be differentiated by the amount of data they produce and consume:

    Theory: The primary scientific paradigm, requiring little in the way of resources or data to construct models
    Experimentation: The use of apparatus, artifacts, and observation to test theories and construct models
    Computation: Arguably a specialization of experimentation, with the tools focused around the unique opportunities provided by numerical techniques afforded by computers

    At each level, increasing amounts of data are required. It could be argued that more data makes each successive level possible (e.g., from theory to experimentation), or it could be argued that a significant-enough change in the volume and kind of data warrants its own description (e.g., computation can be seen as a form of experimentation). The existence of volumes of data alone does not constitute science, and although I cannot imagine a use of data that does not fit into one of the three categories, that does not mean that a new use does not exist.
    Challenges

    Rather than debate the classification of this phenomenon, I think it is more profitable to focus on the challenges presented by this new scale of data-driven science.

    The definition and the dynamics of the scientific artifact are changing. The scholarly communication process is optimized for information artifacts of a certain size and description…While the scientific process is becoming more data-driven, the scholarly communication process, even though largely automated, continues much as it has for hundreds of years.  We must account for the increasing amount of scientific data and associated artifacts that go uncollected by the current communication process.

    Information may want to be free, but data is not as driven. In fact, data is pretty much a slacker: it might have a lot of potential, but it’s not going anywhere unless you stay on top of it.

    As the type and the scale of the data increase, the difficulty in preserving and understanding it also increases: data sets masquerading as books and source code frozen in appendices of journals are insufficient to support data-driven science as it is today.

    In the Archive Ingest and Handling Test (http://www.digitalpreservation.gov/partners/aiht/aiht.html), sponsored by the Library of Congress, I was part of one of four teams tasked with “preserving” a medium-sized website and with exchanging our archive with another project participant after one year. The sobering reality was that once the website had been processed for “archiving,” the exchange of the content was very difficult and required significant manual intervention, despite the level of coordination between project members, the short duration of the project, and the fact that three of the four participants used the same XML encoding scheme (Metadata Encoding and Transmission Standard, or METS).

    The division between code and data is somewhat artificial (not unlike the “data vs. metadata” distinction made in web-based information retrieval), and to focus solely on one without the other is myopic.

    Who will capture this data, and where will it live? Not only are the nature and the size of the science artifacts changing, but the manner in which they are acquired and stored is changing too.
    I don’t know or care if data-driven science is a new paradigm. What I do care about is the data itself: where it will come from and how it will be stored and preserved. Web-scale collections of data will drive new innovations in science. Perhaps all science wants to be data-driven but could not be, until now. Despite our familiarity with the aphorism “information wants to be free,” the story of Brahe, Kepler, and the Rudolphine Tables reminds us that freedom occurs only after a great deal of work, complexity, and expense.

    Data-Driven Science: A New Paradigm?
    © 2009 Michael L. Nelson. The text of this article is licensed under the Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/).

    EDUCAUSE Review, vol. 44, no. 4 (July/August 2009): 6–7

    Michael L. Nelson

    Michael L. Nelson (mln@cs.odu.edu) is an associate professor of computer science at Old Dominion University. Before joining ODU, he spent eleven years at NASA Langley Research Center.

….The reality is that science is becoming data-driven at a scale previously unimagined. The ubiquity of access and the volume of data will fundamentally transform the scientific process.

Data-driven science is not necessarily new: a compelling argument can be made that the astronomer Tycho Brahe and his assistant Johannes Kepler were doing data-driven science, at least by the scale of their time. Kepler published the Rudolphine Tables in 1627, some twenty-six years after Brahe’s death. The tables were a catalog of stars and planets and were largely based on Brahe’s observations, which were considered to be the most accurate and detailed of the time. The Rudolphine Tables formed the core of the data that Kepler used to derive his laws of planetary motion. That the Rudolphine Tables were published at all is amazing: significant infrastructure costs (in the form of purpose-built observatories), professional jealousies, intellectual property restrictions, and political and religious instabilities dominate this story. The cost, the scale, and the legal and social concerns involved in the story of the Rudolphine Tables make it the Google Book project of the seventeenth century.

What Drives Advancement?
Data is the engine that drives all scientific paradigms. The scientific paradigms can be differentiated by the amount of data they produce and consume:

Theory: The primary scientific paradigm, requiring little in the way of resources or data to construct models
Experimentation: The use of apparatus, artifacts, and observation to test theories and construct models
Computation: Arguably a specialization of experimentation, with the tools focused around the unique opportunities provided by numerical techniques afforded by computers

At each level, increasing amounts of data are required. It could be argued that more data makes each successive level possible (e.g., from theory to experimentation), or it could be argued that a significant-enough change in the volume and kind of data warrants its own description (e.g., computation can be seen as a form of experimentation). The existence of volumes of data alone does not constitute science, and although I cannot imagine a use of data that does not fit into one of the three categories, that does not mean that a new use does not exist.
Challenges

Rather than debate the classification of this phenomenon, I think it is more profitable to focus on the challenges presented by this new scale of data-driven science.

The definition and the dynamics of the scientific artifact are changing. The scholarly communication process is optimized for information artifacts of a certain size and description…While the scientific process is becoming more data-driven, the scholarly communication process, even though largely automated, continues much as it has for hundreds of years.  We must account for the increasing amount of scientific data and associated artifacts that go uncollected by the current communication process.

Information may want to be free, but data is not as driven. In fact, data is pretty much a slacker: it might have a lot of potential, but it’s not going anywhere unless you stay on top of it.

As the type and the scale of the data increase, the difficulty in preserving and understanding it also increases: data sets masquerading as books and source code frozen in appendices of journals are insufficient to support data-driven science as it is today.

In the Archive Ingest and Handling Test (http://www.digitalpreservation.gov/partners/aiht/aiht.html), sponsored by the Library of Congress, I was part of one of four teams tasked with “preserving” a medium-sized website and with exchanging our archive with another project participant after one year. The sobering reality was that once the website had been processed for “archiving,” the exchange of the content was very difficult and required significant manual intervention, despite the level of coordination between project members, the short duration of the project, and the fact that three of the four participants used the same XML encoding scheme (Metadata Encoding and Transmission Standard, or METS).

The division between code and data is somewhat artificial (not unlike the “data vs. metadata” distinction made in web-based information retrieval), and to focus solely on one without the other is myopic.

Who will capture this data, and where will it live? Not only are the nature and the size of the science artifacts changing, but the manner in which they are acquired and stored is changing too.

I don’t know or care if data-driven science is a new paradigm. What I do care about is the data itself: where it will come from and how it will be stored and preserved.  Web-scale collections of data will drive new innovations in science.  Perhaps all science wants to be data-driven but could not be, until now.  Despite our familiarity with the aphorism “information wants to be free,” the story of Brahe, Kepler, and the Rudolphine Tables reminds us that freedom occurs only after a great deal of work, complexity, and expense.

Data-Driven Science: A New Paradigm?
© 2009 Michael L. Nelson. The text of this article is licensed under the Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/).

EDUCAUSE Review, vol. 44, no. 4 (July/August 2009): 6–7

Michael L. Nelson

Michael L. Nelson (mln@cs.odu.edu) is an associate professor of computer science at Old Dominion University. Before joining ODU, he spent eleven years at NASA Langley Research Center.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s