The effects of incomplete protein interaction data on structural and evolutionary inferences

Background Present protein interaction network data sets include only interactions among subsets of the proteins in an organism. Previously this has been ignored, but in principle any global network analysis that only looks at partial data may be biased. Here we demonstrate the need to consider network sampling properties explicitly and from the outset in any analysis. Results Here we study how properties of the yeast protein interaction network are affected by random and non-random sampling schemes using a range of different network statistics. Effects are shown to be independent of the inherent noise in protein interaction data. The effects of the incomplete nature of network data become very noticeable, especially for so-called network motifs. We also consider the effect of incomplete network data on functional and evolutionary inferences. Conclusion Crucially, when only small, partial network data sets are considered, bias is virtually inevitable. Given the scope of effects considered here, previous analyses may have to be carefully reassessed: ignoring the fact that present network data are incomplete will severely affect our ability to understand biological systems.

: Average degree distributions (black circles) and empirical 95% confidence intervlas (dashed red lines) obtained from 1000 random subnets of the true S.cerevisiaeprotein interaction network. Also shown are the degree distributions of two random subnets.
In figure 1 we show the average degree distributions (black open circles), the 97.5 and 2.5 percentiles (red dashed lines) and the actual degree distributions of two random subnets. We find that the average (also shown in part A of figure 1) does describe the degree distributions well over a broad range of degrees, especially (and unsurprisingly) for larger.sampling fractions. The 95% confidence interval always broadens at higher degrees, reflecting the broad tailed (though not scale-free [5]) nature of the degree distribution. In particular small values of the sampling fraction, the CIs indicate considerable variability in the tail of the degree distributions.
Predicting the clustering coefficient of the overall network  In uncorrelated networks it is possible to express many quantities of a network in terms of the moments of the degree distribution (see box in manuscript) and for subnets of such networks we can use Eqns. (1) and (2) in the manuscript to write down approximate expressions for the clustering coefficients of the subnets etc.. Here we always assume that the network (and the subnetwork) are sufficiently large and uncorrelated.
For the (approximate) clustering coefficient [3,2] we obtain i.e. in an uncorrelated uniform network the clustering coefficient of a random subnet will be the same as that of the overall network. In reality, however, (see figure 1 of the manuscript) we observe that the clustering coefficient in the subnets is significantly smaller than that of the true network. The decrease in C with decreasing size in the S.cerevisiaedatasets is much faster than the decrease in C in ensembles of classical random graphs of the same size.
Interestingly, we found a very simple functional dependence between the clustering coefficient as a function of sampling fraction, p, where γ can be determined from the fit to the observed data and we obtain γ ≈ 9.549. This allows us to estimate the clustering coefficient of the true network (i.e. the yeast interactome defined by the experimentally accessible interactions among the approximately 6000 proteins) ofĈ 1.0 = 0.095. The fit of this simple, single parameter function to the observed data is very good (see figure 2)

Sampling properties of network components
In figure 3 we show how connected components, in particular the giant connected component, are affected by random sampling of nodes. For p = 10% we are approaching the phase transition where the giant component vanishes. This is clearly seen in the figure. Notice that the number of components increases first with decreasing p before decreasing. This occurs when many components contain only one node and the probability of not sampling such single nodes becomes larger than the probability of further breaking up other larger components. Many important network properties will be influenced by the loss of the GCC; most notably this applies to the non-local phenomena such as average pathlengths, network diameter, betweenness and closeness [1,2,4]. Note however, that for many networks with a broad-tailed degree distribution, Eqn. (6) in the manuscript can be approximated by which will tend to be small (≈ 0.04 in the context of the present Yeast data) such that the GCC should persist for most present PIN datasets even if they were generated by random sampling of nodes. Deviation from the random sampling scheme will of course alter results considerably. In any realistic experimental setup we would expect to see a GCC.

Inferences from Motif-spectra
In the manuscript (figures 1C,D and 2C) we have have seen that motifs, especially the most connected 4-motif are subject to considerable variation in different instances of subnets of the same size. Perhaps more worryingly we found that the Z-score distributions for different sampling probabilities p overlap. This is also confirmed in figure 4 where we show median Z-scores obtained from 20 replicates for five different sampling fractions p = 0.4, 0.45, 0.5, 0.55, 0.6. The lines connecting the Z-score spectra/profiles cross several times and the rank order of z-scores follows no uniform trend.