Saturday, September 17, 2016

Statistics in Science – Embracing the Means-to-an-end Numbers Game


Recently, my wife and I met my boss for dinner to chat career prospects. The conversation was mild (like the curry, and no complaints on both fronts!) and thought provoking. As we walked home dazed and enlightened in equal measure, something lingered in my mind. While on the subject of research interests, Pammi had been asked to elaborate a bit about her analytical skills. What had me thinking was not the subject of the question - it is commonplace for PIs to try and gauge the skill-sets that budding researchers already possess, or wish to acquire. Rather, I tried to wrap my mind around how the question had been phrased. In the context of discussing potentially interesting research projects, it had been presented as though statistical knowledge, rather than merely a set of skills and tools functioning to unravel biologically interesting phenomena, ought to have been a stand-alone research interest all by itself.

The very word “Statistics” frequently, and I think somewhat prematurely, sounds intimidating to an unconditioned ear. Even as I finish writing this sentence, I smother a smile – it is somewhat illogical to blindly categorize something completely novel as “intimidating”. Yet people frequently foster preconceived notions; it is one of the things that make us all too human. Keeping this in mind, the agenda here is neither to elaborate on the power of specific statistical tests like Linear Modeling or Chi-squared, nor riddle this piece with Scatterplots, coefficients, and levels of significance. Rather, it is to ideologically assess the scientific uniqueness of statistics and thereby, dispel some trepidations or misconceptions that may set in before an amateur researcher embarks on the statistical voyage…

In America, Asian students are often stereotyped as being both obsessed and sensational with numbers. “You’ll get an A in Stats because you’re from India!”. This is one of the biggest myths to come out of attending graduate school in the US (aside from the idea that it is a necessary path to take to pay off student loans!), for at least three reasons. First, such statements are almost never backed up by verifiable accounts or, ironically enough, by statistical data. Second, there are plenty of kids in India who, left to themselves, would give up a good chunk of their math tuition time in exchange for more hours on Facebook or Twitter. The structure of India’s education system is such that kids are over-worked, and “learning” math functions as a survival effort that fizzles out once a secure job position is procured. Third, although it is fundamentally based on numbers, applying Statistical tests on a day-to-day basis is somewhat different from what is surgically implanted into our brains in school as ‘Math’.

Picking up on this last point provides the perfect basis to ask whether statistics functions more as a tool than as a stand-alone science. To better assess this, it is beneficial to first recap the scientific approach. Scientific endeavor, irrespective of discipline, adheres for the most part to a single, stepwise paradigm: (1) pick an interesting phenomenon, (2) generate a set of hypotheses and make specific predictions under each, (3) gather data (whether experimental or natural) to test these, (4) analyze this data, and (5) draw conclusions and speculations from the results. So the ‘tool’ rather than ‘stand-alone science’ argument stems from the idea that the field (Statistics) constitutes but one step (4) of this paradigm, rather than the entire paradigm being applicable to the field! In other words, before choosing the right kind of test or model (4), it is imperative to know what it is to be tested (1-3)! This may largely depend on the skill with which facts are handled to generate questions, or testable hypotheses.

Yet scientific phenomena are seldom set in stone. For instance, new postulates are constantly being added even to well-established theories like natural selection, based on exploratory studies of hitherto poorly understood systems. To better understand such novel systems, researchers have to abandon the step-wise paradigm, and ‘boldly go where no one has gone before’. So they resort to throwing the kitchen’s sink, or assessing the impact of several possible factors that might affect an outcome, rather than designing and testing specific statistical equations (or models) that are constructed based on informed hypotheses. For instance, say that a completely novel infectious pathogen has recently been isolated from some animals within a wildlife population. Without any prior information on the pathogen, it is impossible to construct specific hypotheses. Instead, investigators would have to ‘explore’ a wide variety of possibilities – environmental infection, types of contact with other animals, potential insect carriers or transmitters, to name a few -- that may have influenced the outcome of why some individuals were infected and others were not. In such cases, statistical analysis, rather than being informed by hypothesis construction, actually facilitates subsequent hypothesis construction that may be more generally applicable to similar types of pathogens and wildlife populations in the future. But does that swing the pendulum back in the direction of statistics holding its ground as a unique science? Only if the definition: “Statistics: the science of throwing the kitchen’s sink” has a serious ring to it! Here statistics ventures into the art realm. Still functioning as a tool, statistical knowledge in such contexts additionally involves the skill with which a researcher decides how best to represent the effects of an entire suite of characteristics that may affect a desired outcome. Diagnostic plots, graphs, and charts that help better visualize what is going on will precede mathematical equations and models. These are constructed much later and after some basic knowledge about the system is gleaned.

          In both the above-described approaches, the tool-rather-than-stand-alone-science argument does have a strong case. Yet rather than just solving equations or manipulating numbers like mathematicians, Statisticians have to, at some juncture, take a couple of steps back to formulate the equations themselves. So concluding that statistical knowledge is more of a tool does not in any way belie its significance (pun intended). This set of tools are unique in that they don’t merely involve being proficient with numbers, possessing a strong background in mathematics, or being skilled at coding in Python or R. It is, in equal if not greater measure, about possessing adequate scientific knowledge about a study-system, the skill of hypothesis construction, and the art of handling scientific novelty. Just as it is difficult to see statistics standing on its own without testable scientific phenomena, doing good science is also inextricably linked to embracing the means-to-an-end numbers game.