Joao Pita Costa 2015
Influenzanet is a system to monitor the activity of influenza-like-illness (ILI) with the aid of volunteers via the internet. It has been operational for more than 10 years at the EU level since 2008. In contrast with the traditional system of sentinel networks of mainly primary care physicians, Influenzanet obtains its data directly from the population. This creates a fast and flexible monitoring system whose uniformity allows for direct comparison of ILI rates between countries.
Persistent homology is a central tool in topological data analysis, which examines the structure of data through topological structure. In the past years it has taken an important role in the development of medicine. It is an area of mathematics interested in identifying a global structure by inferring high-dimensional structure from low-dimensional representations and studying properties of a often continuous space by the analysis of a discrete sample of it, assembling discrete points into global structure. The basic technique can be extended in many different directions, permuting the encoding of topological features by barcodes and correspondent persistence diagrams.
Using persistence we are able to analyze the Influenzanet data identifying several topological features relevant to the epidemiological study. In particular, we can identify data noise, distinguish higher dimension features and look at join spaces between countries. This is done both in terms of the overall structure of a disease as well as its evolution. Finally, it provides a way to test agreement at a global scale arising from standard local models.
We are fundamentally interested in applying the new technology of persistence homology to have a topological analysis of the medical data provided by some of the members of the network Influenzanet. For a first approach we use the data of the partner in the Netherlands, the first one to be active in this network, collecting data from the flu season 2003/04 to the flu season 2013/14. This particular collection of data that we've been looking at presents the fie
lds 'date', 'participants' and 'ILI' where the last field is divided in four different cases of Influenza Like Illness, according to the collected online questionnaires. Below you can see the 3D plot of this data (left), the simplicial complex constructed over it (centre), and the correspondent persistence diagram.
The computation of persistence diagrams via Vietoris-Rips complexes was done using Perseus, the open source persistent homology software. Such complexes are completely determined by the underlying 1-skeleton. That structure can be represented as a symmetric distance matrix where the entries come from pairwise distances between points in a point cloud. Perseus can compute the persistent homology of Vietoris complexes generated that distance matrix. The following images represent the Euclidean distance matrix (left), the Mahalanobis distance matrix (centre), and the covariance matrix for the same data (right).
We have constructed several algorithms to clean the data prior the construction of the Vietoris-Rips complexes. The images bellow show the effect of those algorithms on the Mahalanobis metric: subsampling (left), colliding close enough data points (centre), and setting the distance of adjacent data points to zero (right).
Topological Data Analysis is interested in problems relating to nonlinear systems, large scale data and development of more accurate models, having a large impact on: social media, robotics, natural image statistics, and cancer research. Essentially it applies the qualitative methods of topology to problems of machine learning, data mining and computer vision.
In particular, persistent homology is an area of mathematics interested in identifying a global structure by inferring high-dimensional structure from low-dimensional representations and studying properties of a often continuous space by the analysis of a discrete sample of it, assembling discrete points into global structure. When considering a notion of distance on the space, one gets a perspective of the space under different scales, where small features will eventually disappear. Persistence allows us to compute the homology at all scales, thereby giving us the ability to find ranges of scales where the structure of the space is stable. The figure below indicates the relation between the construction of the simplex and a certain lifetime of its elements.
Techniques of persistence can be used to infer topological structure in data sets while certain variations on the method can be applied to study aspects of the shape of point clouds. By considering all possible scales, one can infer the correct scale at which to look at the point cloud simply by looking for scales where the persistent homology is stable. Multi-scale methods are thus enable to study the topology of point clouds as a route for approximating topological features of an unobservable geometric object generating samples. For these reasons, barcodes seen as multi-scale signatures are of great importance to the application of this research to the development of new techniques in machine learning. These barcodes can be encoded within pairs of numbers represented in the quarter plane, indicating birth and death time of a certain topological feature of the data: persistence diagrams. Much of the applications within the recent research require manipulation and comparison of persistence diagrams. In the following figure we present the persistence diagrams for the original data regarding dimension 0 (left), dimension 1 (centre), and dimension 2 (right).
We can already see that most of the features that have a death time live in dimension 1, while dimension 2 is mostly populated by features that live forever. In the next figure we now present the persistence diagrams for the subsampled data regarding dimension 0 (left), dimension 1 (centre), and dimension 2 (right).
Topological data analysis is interested in problems relating to nonlinear systems, large scale data and development of more accurate models, that contribute to a high level research. The study of Epidemiology is a great source of problems that focus aspects of such nature. Moreover, persistence can provide such research with high dimension techniques for medical data analysis. In particular, per- sistence diagrams are a clear and practical tool that allows us the detection of outliers and to capture the dynamics of the system. During further reseach we will investigate if those exceptional points evident for the Mahalanobis metric are distinct under other metrics, and learn the appropriate metric such that those outliers are close enough. We shall also use kernels and SVMs on these features, enabling a machine learning approach over this data.