Big data, little value
Data will be useful only when it is trustworthy
They say the best part of any model is not what comes out, but what goes in. Garbage in, garbage out is the old adage. These days, everyone has predictions. There are quite a few in Pakistan too though it remains unclear what mathematics, epidemiological phenomenon or theory is underpinning them. Among the many bandwagons people have jumped on, big data, data analytics and deep learning are some of them — though few recognise what is inside the black box of these models.
A model is a simplified system to predict what might happen. A model is not perfect nor complete. Better models are built upon fundamental principles, sound theoretical basis and capture the nuances of a system. Their inaccuracies can be explained by a recognition that we do not fully understand all parts of the system.
The flashy ones, however, often look at the past and try to predict the future. They can be very powerful if one has lots of high-quality data and understands how a system may behave under certain circumstances. These models have become quite popular, not just in Pakistan (where there is a fixation with buzzwords), but elsewhere too. They have also been questioned repeatedly by experts, including the famous Institute for Health Metrics and Evaluation (IHME) model, that has often been touted by the White House and continues to revise its predictions without much basis. There are legitimate questions about its underpinnings, its daily variations and why it continues to swing so wildly. For Massachusetts (the state where I live) the model went from median expected deaths by August 4 from a couple of thousand in late March, to 8,250 by early April, and as of Sunday, that number fell again to 3,236. Little has changed in terms of real policy in the state and the wild swings are exposing the model’s structural issues. Other states have seen similar wild projections.
The problem with such big data models in Pakistan is far more acute. First, the total number of tests conducted in the country is miniscule. The total number of tests conducted are less than a 100,000 (in a country of 210 million; the US conducted nearly four million and they are not enough). Second, globally tests have a high false negative rate (as much as 30%), meaning a third of people who test negative may have the virus. There are other peculiarities e.g. according to NIH data, nearly 80% of people tested positive in Pakistan are men. While globally there are slight differences, there is no epidemiological or biological reason that men are four times likely to catch the disease. Such a bizarre proportion should be met with deep suspicion about testing capacity and undercounting infected women. Similar numbers among the deceased are equally troubling. We should worry if our models are based on low numbers, high false negative and show that nearly 80% of all positive cases are men.
The data issues are really serious, but there are other challenges unique to Pakistan. The ulema’s position regarding congregational prayers, the mixed messages about what constitutes an essential business (e.g. barbers and tailors) and other local realities should factor into serious future projections.
So what should we do? For starters, stop using meaningless buzzwords. Second, think deeply about systematic bias in the data and why women are being undercounted. Third, recognise that we need to have our own models, rooted in transparency, science and our own context, not ones copied from elsewhere. Fourth, and most importantly, test, test, test.
Data will be useful only when it is trustworthy. Until then big data, and our fixation to model the future will only make big mistakes.
Published in The Express Tribune, April 21st, 2020.
Like Opinion & Editorial on Facebook, follow @ETOpEd on Twitter to receive all updates on all our daily pieces.