These features are discussed in detail in:
Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, Ricardo
Baeza-Yates: "Using Rank Propagation and Probabilistic Counting for Link-Based
Spam Detection". In Proceedings of the Workshop on Web Mining and Web Usage
Analysis (WebKDD). Philadelphia, USA, August 2006. ACM Press.
The list of the URL identifier home pages and the pages with the maximum PageRank is here:
http://datamining.sztaki.hu/files/DiscoveryChallenge/features/v2-DiscoveryChallenge2010.homepageuid_maxpruid.csv.gz
In both cases, the second field points to an element in this list starting from zero:
http://datamining.sztaki.hu/files/DiscoveryChallenge/links/v2-DiscoveryChallenge2010.urls.txt.gz
==========================================================================
hostid
Identifier of the host in the hostgraph
hostname
Name of the host, including portname if different from the default (80). Note that there are some hosts that have more than one port open.
eq_hp_mp
Is the home page the page with the maximum PageRank in the host? 0=no 1=yes
assortativity_hp
Assortativity coefficient of the home page (degree / average degree of neighbors). Degree in this case is undirected (in_degree+out_degree)
assortativity_mp
Assortatitivy coefficient of the page with the maximum PageRank
avgin_of_out_hp
Average in-degree of out-neighbors of home page (hp)
avgin_of_out_mp
Average in-degree of out-neighbors of page with maximum PageRank (hp)
avgout_of_in_hp
Average out-degree of in-neighbors of hp
avgout_of_in_mp
Average out-degree of in-neighbors of mp
indegree_hp
Indegree of hp
indegree_mp
Indegree of mp
neighbors_2_hp
Neighbors at distance 2 of hp
neighbors_2_mp
Neighbors at distance 2 of mp
neighbors_3_hp
Neighbors at distance 3 of hp
neighbors_3_mp
Neighbors at distance 3 of mp
neighbors_4_hp
Neighbors at distance 4 of hp
neighbors_4_mp
Neighbors at distance 4 of mp
outdegree_hp
Out-degree of hp
outdegree_mp
Out-degree of mp
pagerank_hp
PageRank of hp (calculated in the doc graph with no self-loops, using a damping factor of 0.85, with 50 iterations)
pagerank_mp
PageRank of mp
prsigma_hp
Standard deviation of the PageRank of in-neighbors of hp
prsigma_mp
Standard deviation of the PageRank of in-neighbors of mp
reciprocity_hp
Fraction of out-links that are also in-links of hp. For instance, if the hp
has 5 out-links, and 3 of those pages links back to the home page, the
assortativity coefficient is 3/5. A page with no out-links has assortativity
coefficient of 0.
reciprocity_mp
Fraction of out-links that are also in-links of mp
siteneighbors_1_hp
Number of different hosts pointing to hp, obtained by approximate algorithm (could have been done exactly, but used the approximate algorithm)
siteneighbors_1_mp
Number of different hosts pointing to mp
siteneighbors_2_hp
Number of different hosts (approx.) supporting at distance 2 the hp
siteneighbors_2_mp
Number of different hosts (approx.) supporting at distance 2 the mp
siteneighbors_3_hp
Number of different hosts (approx.) supporting at distance 3 the hp
siteneighbors_3_mp
Number of different hosts (approx.) supporting at distance 3 the mp
siteneighbors_4_hp
Number of different hosts (approx.) supporting at distance 4 the hp
siteneighbors_4_mp
Number of different hosts (approx.) supporting at distance 4 the mp
truncated_pagerank_1_hp
TruncatedPageRank using truncation distance 1, hp
truncated_pagerank_1_mp
TruncatedPageRank using truncation distance 1, mp
truncated_pagerank_2_hp
TruncatedPageRank using truncation distance 2, hp
truncated_pagerank_2_mp
TruncatedPageRank using truncation distance 2, mp
truncated_pagerank_3_hp
TruncatedPageRank using truncation distance 3, hp
truncated_pagerank_3_mp
TruncatedPageRank using truncation distance 3, mp
truncated_pagerank_4_hp
TruncatedPageRank using truncation distance 4, hp
truncated_pagerank_4_mp
TruncatedPageRank using truncation distance 4, mp
trustrank_hp
TrustRank of hp (obtained using 2,459 hosts from ODP as trusted set) -- the list of URL identifiers used is at http://datamining.sztaki.hu/files/DiscoveryChallenge/features/v2-DiscoveryChallenge2010.odp_docid.csv.gz NOTE: this feature can be improved by using more ODP hosts in the seed set.
trustrank_mp
TrustRank of mp