We have explored effects of sampling on statistical measures of protein interaction structure for different sampling schemes. Our comparison with the effects of noisy interaction data (see figure 2) suggests that sampling and noise affect network statistics in different ways and we have therefore concentrated on the sampling effects as noise has received considerable attention previously (see, for example, [28, 13, 29]). Previous studies of network sampling properties focused on the degree distribution [4, 30, 6]. In our analysis we confirmed the results of these earlier studies, but one aspect of this study deserves closer scrutiny: with decreasing sampling fraction the degree distribution of the randomly sampled subnets becomes straighter and the slope of the best-fit line becomes steeper. More interestingly, we find that for a data set which had previously [28] been classified as consisting of more reliable interactions, the degree distribution appears to be reasonably similar to the degree distribution of the overall network (this can be also quantified statistically [5]), especially when compared with the randomly generated subnetwork ensemble.

Not surprisingly, we find that the effects of sampling on other statistical measures such as clustering coefficient, betweenness and motifs are more intricate (average pathlengths and diameter [1] have similarly diverse sampling properties). As statistical measures become less local, the effects of sampling become increasingly subtle. For example, BC is a non-local property and the effects of sampling act locally as well as globally as the system undergoes a structural phase transition with the giant connected component [19, 31] breaking up as *p* decrease. Thus the fraction of pairs of nodes which are connected (belong to the same component) decreases and an increasing fraction of nodes has a BC value of 0. On the other hand, the fraction of shortest paths passing through the connected nodes increases systematically.

Motifs are local objects [11, 12, 32] but *Z*-scores are constructed using a global network-rewiring approach [33, 34]. Therefore their sampling properties are more intricate than those of subgraphs that are defined differently [35]. This dual nature of motifs – they are local objects but their significance is assessed against a globally randomized network ensemble – explains the qualitative differences in their behaviour under different sampling regimes.

In addition to the sampling properties, one result which becomes obvious from the present analysis is that subnets of the same size can differ quite considerably; and, in particular, the more complex measures of network structure such as motif spectra can exhibit variances that overwhelm the mean or median statistics. This becomes particularly apparent in Figure 1C. It is partially for this reason that we have not emphasised the non-random sampling schemes more: a single instance of a network statistic represents only an instance of a sample drawn from an ensemble; for networks sampling of nodes leads to very broad distributions of sample statistics as would be expected for such highly correlated and structured data sets [1]. Sampling and noise affect these network statistics differently, with incomplete data introducing variability as well as systematic bias, and noise affecting almost exclusively the variance in, for example, the *Z*-scores of motifs.

For random subnets we also compared evolutionary results previously obtained for the "complete network" for the randomly generated networks. In Agrafioti *et al*. [22] only the effects of local structure (*i.e*. degree) were used and in light of the previous discussion it is therefore not surprising that the central results are generally confirmed in the subnets: in particular protein expression level correlates better than degree with protein evolutionary/substitution rate. For the non-random sampling schemes the data are biased in favour of protein abundance and results are also confirmed, but potentially biased somewhat against degree. In general, single-node properties of proteins are statistically conserved in the subnet, *e.g*. the protein with the highest degree will, provided it is being included in the sample, tend to have the highest degree also in the subnet. As far as biological and functional inferences are concerned, the effects of network sampling properties appear to be not very different from statistical missing data problems. Thus the biological studies, which investigate, for example, the interplay between protein domain structure and protein interactions [23] are probably not affected. Investigating such properties across a network [36], however, may be subject to bias because of the intricacies displayed by the network sampling behaviour discussed here.