SolrNet supports Solr "extract" feature (a.k.a. Solr "Cell") to index data from binary document formats such as Word, PDF, etc.
Here's a simple example showing how to extract text from a PDF file, without indexing it:
ISolrOperations<Something> solr = ...
using (var file = File.OpenRead(@"test.pdf")) {
var response = solr.Extract(new ExtractParameters(file, "some_document_id") {
ExtractOnly = true,
ExtractFormat = ExtractFormat.Text,
});
Console.WriteLine(response.Content);
}
ExtractOnly = true
tells Solr to just perform text extraction but not index the uploaded document.
If ExtractOnly = false
you can add more fields with the Fields
property.
Other options can be set through the properties of the ExtractParameters
class.
It's usually recommended to provide the StreamType
for the content, as auto-detection might fail.
For more details about each option in ExtractParameters
see the Solr wiki and the Solr reference guide.