Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gephi export limit #464

Merged
merged 3 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,11 @@ UNRELEASED
-----
Upgraded solr dependencies from v9.1.0 to v9.4.1
HTML pages with geo tag will no longer we found in image GEO search.
Fixed Gephi export regression bug, not all results was extracted due to Gephi also was limit by CSV export limit size in property file.
Added SolrWayback ASCII logo in log file when started successfully.
Add support for Memento API, including timegates and timemaps. Memento properties added to solrwayback.properties (Thanks @VictorHarbo )


5.1.2
-----
Bug fix. Chunking was not removed in all cases. This was only relevant for WARC-files that are created with chunking. (not Heritrix)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -566,9 +566,25 @@ public static InputStream exportWarcStreaming(boolean expandResources, boolean e
return new StreamingSolrWarcExportBufferedInputStream(solrDocs, max, gzip); // Use maximum export results from property-file
}

/**
* <p>
* Query will have filter: content_type_norm:html AND links_domains:* AND url_type:slashpage"
* </p>
* <p>
* The same domains will appear many time in the solr result set, but only first one will be
* added the the csv file. The extraction uses a HashMap to remember what domains has been added.
* </p>
* <p>
* The 100M limit of solr documents will most likely result in a final CSV file with less than 1M documents.
* And this should be enough since Gephi can not handle more than 1M nodes.
* Split extraction into crawl_year if the 100M limit is not enough.
* </p>
* @param q The query
*
*/
public static InputStream exportLinkGraphStreaming(String q) {
SolrStreamingLinkGraphCSVExportClient solr = SolrStreamingLinkGraphCSVExportClient.createExporter(null, q);
return new StreamingSolrExportBufferedInputStream(solr, 1000000); // 1 MIL
return new StreamingSolrExportBufferedInputStream(solr, 100000000); //100M limit, the CSV streaming extractor needs a limit.
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ public void contextInitialized(ServletContextEvent event) {
log.info("Using default warc-file-resolver implementation");
}

log.info(SolrWaybackAsciiLogo.SOLRWAYBACK_LOGO); //Add nice logo when started successfully.
log.info("solrwayback version " + version + " started successfully");

} catch (Exception e) {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
package dk.kb.netarchivesuite.solrwayback.listeners;

public class SolrWaybackAsciiLogo {

//Some characters are escaped, looks fine when printed.
public final static String SOLRWAYBACK_LOGO=
"\n"
+ " _______. ______ __ .______ ____ __ ____ ___ ____ ____ .______ ___ ______ __ ___ \n"
+ " / | / __ \\ | | | _ \\ \\ \\ / \\ / / / \\ \\ \\ / / | _ \\ / \\ / || |/ / \n"
+ " | (----`| | | | | | | |_) | \\ \\/ \\/ / / ^ \\ \\ \\/ / | |_) | / ^ \\ | ,----'| ' / \n"
+ " \\ \\ | | | | | | | / \\ / / /_\\ \\ \\_ _/ | _ < / /_\\ \\ | | | < \n"
+ " .----) | | `--' | | `----.| |\\ \\----. \\ /\\ / / _____ \\ | | | |_) | / _____ \\ | `----.| . \\ \n"
+ " |_______/ \\______/ |_______|| _| `._____| \\__/ \\__/ /__/ \\__\\ |__| |______/ /__/ \\__\\ \\______||__|\\__\\"
+ "\n";



public static void main(String[] args) {
System.out.println(SOLRWAYBACK_LOGO);
}
}
Loading