Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table clipper functionality (norma-0.7.0-alpha) and supporting analysis code #73

Open
wants to merge 67 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
636ed8d
fixed tests
petermr Apr 8, 2016
449c1ab
Merge pull request #24 from ContentMine/dev
petermr Apr 8, 2016
98d8199
add SVG2CSV transform
petermr Jun 8, 2017
c2a15a0
Merged pmr local master into old remote pmr/master
petermr Jun 8, 2017
b764d5a
Remove Grobid from norma
petermr Jun 10, 2017
b648537
add plot examples
petermr Jun 10, 2017
4936ffc
add pom; reinstate PlotBox
petermr Jun 10, 2017
322211c
added more examples
petermr Jun 11, 2017
bbd7ddc
modify NormaTransformer to manage figues
petermr Jun 11, 2017
8ceac7e
Added jar-with-dependencies
petermr Jun 11, 2017
927b075
add new tests for compactsvg
petermr Jul 5, 2017
d9c92a6
add test data
petermr Jul 5, 2017
b3c5f7b
add compactsvg code
petermr Jul 5, 2017
30520ee
update NormaTransformer to allow compactsvg
petermr Jul 5, 2017
f544889
update args.xml
petermr Jul 5, 2017
5a01bee
update tests
petermr Jul 5, 2017
5126103
small updates
petermr Jul 10, 2017
f9b5212
enhancements to allow rotation
petermr Jul 30, 2017
72173e0
added tests for ruled, gridded, continuation tables
petermr Aug 4, 2017
67393ee
tidied tests
petermr Aug 7, 2017
f29d18e
files moved to svg2xml
petermr Aug 9, 2017
5fae7e9
new tests for UCL in norma
petermr Sep 22, 2017
d21e1c5
changes to NormaTransformer due to upstream changes
petermr Sep 22, 2017
12e3c46
added tests
petermr Sep 22, 2017
2184e02
cleaned code after major refactor
petermr Oct 31, 2017
d6e48f3
fixed pom
petermr Oct 31, 2017
b1ba3ae
added tei and xsl tests
petermr Jan 6, 2018
725d4a1
started to tidy converter options
petermr Jan 6, 2018
f437613
fixed tests after refactoring
petermr Jan 6, 2018
72cf0f0
converted dependency to cproject
petermr Jan 6, 2018
55a287f
Sort tab labels to order by numeric suffix
jkbcm Jan 15, 2018
7051a35
Omit deb build
jkbcm Jan 15, 2018
f0ae82e
Ensure TextStructurer is present
jkbcm Jan 15, 2018
8a2df91
Remove trailing space on directory name (NTFS compatibility)
jkbcm Jan 15, 2018
00a224c
Remove trailing space in directory name (NTFS compatibility)
jkbcm Jan 15, 2018
5b00acb
Add missing dependency
jkbcm Jan 17, 2018
932c59e
Temporarily set test needing missing resources to ignore
jkbcm Jan 17, 2018
337154e
Ignore testCreateSvgHtml silently fails with fileFilter / paths mismatch
jkbcm Jan 19, 2018
3685a69
reinstating PDF transformation in norma
petermr Jan 17, 2018
9cd93a9
added clipping demo and also Norma convenience method
petermr Jan 25, 2018
4aeed81
New build dependency order
jkbcm Feb 4, 2018
eefa67a
Merges of Table Clipper (PageCropper) functionality
jkbcm Feb 4, 2018
25d7fbc
Use new styling for semantic structure. Fix tab button label order.
jkbcm Nov 9, 2017
fda10b0
Fix dependencies after cherry-pick from master (old stack)
jkbcm Feb 5, 2018
96bc84c
added page cropping; --page and --cropbox
petermr Jan 24, 2018
6c8d950
added demos bmj cert lancet
petermr Jan 25, 2018
8b96bd6
added RegionFinder
petermr Jan 25, 2018
b61de3e
added clipping demo and also Norma convenience method
petermr Jan 25, 2018
5f7024f
finished grobid-html.xsl
petermr Jan 25, 2018
94e5d05
Merge some recent changes related to ongoing XSL development
jkbcm Feb 5, 2018
f5c3b30
Add PDF.js test doc
jkbcm Feb 5, 2018
3878346
Add style for table footer
jkbcm Feb 5, 2018
678c818
Direct log4j output to new daily rotated norma.log RollingAppender
jkbcm Nov 9, 2017
b1615a5
Move Assert for fulltext-page1.svg existence to after it is now created
jkbcm Feb 5, 2018
6e9daef
Ensure build is cross-platform copy resources using UTF-8
jkbcm Feb 6, 2018
729b801
Set jdeb plugin to skip to avoid building .deb every time
jkbcm Feb 6, 2018
50d768f
Ensure command prompt is on new line after any runtime output
jkbcm Feb 13, 2018
6016d04
Output simple confirmation of files processed to user
jkbcm Nov 7, 2017
19fb831
Ensure command-line runs return the prompt to a new line.
jkbcm Feb 19, 2018
d550ccb
Remove duplicate sections from pom
jkbcm Feb 27, 2018
ebfb977
HtmlAggregate transform accepts long table numbers to support UIDs
jkbcm Feb 27, 2018
53c36e7
Reinstate testMenu with valid target directory and run following test…
jkbcm Feb 28, 2018
283a13b
Version number for release 0.6.1-alpha
jkbcm Feb 28, 2018
1f756e7
Ensure convenience methods have path regexes converted to current
Feb 28, 2018
2ba09b7
Use different exception messages for SVG file/dir creation errors.
Feb 28, 2018
68d5b78
Ensure paths and path regexes in tests are cross-platform.
Feb 28, 2018
1333537
Version number 0.7.0-alpha
jkbcm Mar 1, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
75 changes: 49 additions & 26 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,29 @@
<modelVersion>4.0.0</modelVersion>

<properties>
<norma.version>0.4.0</norma.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<!-- upstream -->
<cproject.version>0.4.0</cproject.version>
<imageanalysis.version>1.1.0</imageanalysis.version>
<!-- other -->
<opennlp.version>1.6.0</opennlp.version>
<cproject.version>0.7.0-SNAPSHOT</cproject.version>

<imageanalysis.version>1.3.0-SNAPSHOT</imageanalysis.version>
<!-- this is the new PDFBox2 version -->
<pdf2svg2.version>2.3.0-SNAPSHOT</pdf2svg2.version>
<opennlp.version>1.6.0</opennlp.version>
<xml-apis.version>1.4.01</xml-apis.version>
<Saxon-HE.version>9.6.0-3</Saxon-HE.version>
<json-path.version>2.0.0</json-path.version>
<jsoup.version>1.8.2</jsoup.version>
<xmlunit.version>1.4</xmlunit.version>
<jdeb.version>1.3</jdeb.version>



</properties>

<groupId>org.contentmine</groupId>
<artifactId>norma</artifactId>
<!-- to sync with new cproject-norma-ami versions -->
<version>${norma.version}</version>
<version>0.7.0-alpha</version>
<packaging>jar</packaging>
<name>norma</name>
<description>A Java library for processing multiple legacy formats into normalized HTML5</description>
Expand Down Expand Up @@ -74,6 +85,7 @@
</execution>
</executions>
<configuration>
<skip>true</skip>
<dataSet>
<data>
<src>${project.build.directory}/appassembler/</src>
Expand Down Expand Up @@ -135,20 +147,6 @@
<argLine>-Xmx1024m -XX:MaxPermSize=256m</argLine>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5</version>
<configuration>
<source>1.5</source>
<target>1.5</target>
</configuration>
</plugin>
<!-- giant jar -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
Expand Down Expand Up @@ -180,6 +178,13 @@
</build>

<dependencies>
<!--
<dependency>
<groupId>org.contentmine</groupId>
<artifactId>svghtml</artifactId>
<version>${svghtml.version}</version>
</dependency>
-->
<dependency>
<groupId>org.contentmine</groupId>
<artifactId>cproject</artifactId>
Expand All @@ -190,10 +195,19 @@
<artifactId>imageanalysis</artifactId>
<version>${imageanalysis.version}</version>
</dependency>
<!-- the new version -->
<!-- remove while refactoring -->
<!--
<dependency>
<groupId>org.contentmine</groupId>
<artifactId>pdf2svg</artifactId>
<version>${pdf2svg2.version}</version>
</dependency>
-->
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>Saxon-HE</artifactId>
<version>9.6.0-3</version>
<version>${Saxon-HE.version}</version>
</dependency>
<dependency>
<groupId>org.apache.opennlp</groupId>
Expand All @@ -205,34 +219,43 @@
<dependency>
<groupId>com.jayway.jsonpath</groupId>
<artifactId>json-path</artifactId>
<version>2.0.0</version>
<version>${json-path.version}</version>
</dependency>
<!-- to avoid Xerces Hell?
http://stackoverflow.com/questions/17777821/maven-dependency-conflict-org-w3c-dom-elementtraversal
-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.2</version>
<version>${jsoup.version}</version>
</dependency>
<dependency>
<groupId>xmlunit</groupId>
<artifactId>xmlunit</artifactId>
<version>1.4</version>
<version>${xmlunit.version}</version>
</dependency>
<dependency>
<groupId>org.vafer</groupId>
<artifactId>jdeb</artifactId>
<version>1.3</version>
<version>${jdeb.version}</version>
<!-- Prevents jar bloat in final package -->
<scope>provided</scope>
</dependency>


<!--
<dependency>
<groupId>org.grobid</groupId>
<artifactId>grobid-core</artifactId>
<version>0.4.1</version>
</dependency>
-->

<dependency>
<groupId>xml-apis</groupId>
<artifactId>xml-apis</artifactId>
<version>${xml-apis.version}</version>
</dependency>


</dependencies>
</project>
33 changes: 33 additions & 0 deletions src/main/java/org/xmlcml/norma/Norma.java
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
package org.xmlcml.norma;

import java.io.File;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.xmlcml.cproject.args.DefaultArgProcessor;
import org.xmlcml.cproject.util.Utils;

public class Norma {

Expand All @@ -16,8 +19,17 @@ public class Norma {
private DefaultArgProcessor argProcessor;

public static void main(String[] args) {
try {
Norma norma = new Norma();
norma.run(args);
} catch (Exception ex) {
LOG.trace(ex.getMessage());
LOG.trace(ex.getStackTrace());
System.err.println(ex.getMessage());
}
// Ensure command prompt is on a new line
// after any runtime outputs
System.out.println();
}

public void run(String[] args) {
Expand All @@ -26,11 +38,32 @@ public void run(String[] args) {
}

public void run(String args) {
args = args == null ? null : args.trim();
argProcessor = new NormaArgProcessor(args.split("\\s+"));
argProcessor.runAndOutput();
}

public DefaultArgProcessor getArgProcessor() {
return argProcessor;
}

/** converts a projectDirectory to a project and the PDFs to SVG
*
* @param projectDir
*/
public static void convertRawPDFToProjectToSVG(File projectDir) {
// Ensure fileFilter is cross-platform compatible
String fileFilterString = Utils.convertPathRegexToCurrentPlatform(".*/(.*)\\.pdf");
new Norma().run("--project "+projectDir+" --makeProject (\\1)/fulltext.pdf --fileFilter "+fileFilterString);
new Norma().run("--project " + projectDir + " --input fulltext.pdf "+ " --outputDir " + projectDir + " --transform pdf2svg ");
}

/** converts a projectDirectory to a project and the PDFs to SVG
*
* @param projectDir
*/
public static void convertRawTEIXMLToProject(File projectDir) {
new Norma().run("--project "+projectDir+" --makeProject (\\1)/fulltext.xml --fileFilter .*\\/(.*)\\.xml");
// new Norma().run("--project " + projectDir + " --input fulltext.tei.xml "+ " --outputDir " + projectDir + " --transform tei2html ");
}
}
Loading