reference paper - http://ceur-ws.org/Vol-2645/paper3.pdf presentation - cs19mds11002_privacy_summarization.pptx for details
i. use ToSDr api https://tosdr.org/api to get quoted text with label for privacy services
ii. use only approved cases
iii. based on similarity score>80%, match quoted text from a to quoted text of publicly available dataset used in paper
iv. generate a partial training dataset with matched texts.
i. from training set services, download the actual html text from privacy pages.
ii. keep some of the services policy aside for test held_out_test_data
i. use the downloaded input from step 2 and partial training set from step 1
ii. compare and augment more neutral sentences from the actual source.
iii. create final labeled training dataset.
a. use word2Vec embeddings to generate vectors for word token
b. create sentence matrix as explained in paper
c. neural network architecture
CONV1D -> Max Pool -> Concat -> Dropout
a. run 20 epochs to train the model in batches
b. evaluate precision, recall,f1 score
a. Summary Extractor has both ways as explained in paper
Risk Focused Content Extraction
Risk Coverage Extraction
hyper parameters
alpha - risk threshold.
compression ratio - used to calcuate budget.