Discover how a browser interacts with pdf2doc.com and reverse engineer a solution that does not require a browser.
A combination of the Chrome Developer Tools and the curl
command line utility were used to conduct this research.
-
The website uses the Plupload API to handle file uploads.
-
Uploads and progress notifications are done through AJAX - the page never reloads.
-
Upon page load, a 16 character session ID (
sid
) is generated using the following method:function randomString() { for (var t = "0123456789abcdefghiklmnopqrstuvwxyz", e = 16, i = "", n = 0; e > n; n++) { var a = Math.floor(Math.random() * t.length); i += t.substring(a, a + 1) } return i }
-
Before uploading, the Plupload library generates a ~30 character unique file ID (
fid
) using the following method:var guid = (function() { var counter = 0; return function(prefix) { var guid = new Date().getTime().toString(32), i; for (i = 0; i < 5; i++) { guid += Math.floor(Math.random() * 65535).toString(32); } return (prefix || 'o_') + guid + (counter++).toString(32); }; }());
-
The upload process begins when a POST request with Content-Type of "multipart/form-data" is sent to the
/upload/<sid>
endpoint. The request also contains three parameters:name
The filename, ex. "Test.pdf"id
The file ID (fid
)file
The file itself, in binary format.
NOTE: Although
sid
andfid
are generated using the methods above, the fact that they are created client-side means that you can substitute your own values if you so wish.fid
appears to accept any value, whilesid
must be 16 characters long in order to be processed.An example
curl
looks like this:curl -X POST -F "name=ID-Test.pdf" -F "id=testing" -F "[email protected]" -H "Content-Type: multipart/form-data" http://pdf2doc.com/upload/3sw4i3wpq25qm46s
The response is sent as JSON and looks like this:
{ "data": { "file": "Test.pdf", "file_size_human": "74K" }, "id": "testing", "jsonrpc": "2.0", "result": null }
-
Immediately after uploading, the page sends a GET request to
/convert/<sid>/<fid>?rnd=<rnd>
.rnd
is generated usingMath.random()
and can be omitted from the request. I believe it simply acts as a cache-busting mechanism.An example
curl
looks like this:curl http://pdf2doc.com/convert/3sw4i3wpq25qm46s/testing
And the response:
{"status": "success"}
I wasn't able to get a conversion to fail (I didn't really try) but it is certainly possible - and if it does, this is probably where you can find out.
-
The conversion can be monitored through the
/status/<sid>/<fid>?rnd=<rnd
endpoint.rnd
serves the same purpose here as it did previously.An example
curl
looks like this:curl http://pdf2doc.com/status/3sw4i3wpq25qm46s/testing
Response:
{ "fid": "testing", "progress": 0, "sid": "3sw4i3wpq25qm46s", "status": "processing", "status_text": null }
Presumably,
progress
changes over time to reflect how close the conversion is to being completed. In addition, the JSON format changes once the conversion is completed:{ "convert_result": "Test.doc", "fid": "testing", "progress": 100, "savings": null, "sid":" 3sw4i3wpq25qm46s", "status": "success", "thumb_url": "\/files\/3sw4i3wpq25qm46s\/testing\/thumb.png?nimg" }
convert_result
is the filename of the newly converted document.thumb_url
is a URI leading to a 125x77 screenshot of the converted document. The query (nimg
) appears to be another randomly generated cache-busting mechanism.NOTE: If you visit this endpoint before hitting the previous one (
/convert
), you will get the following error:{ "details": "Conversion error.", "status": "error" }
This does not actually mean the conversion failed, it just means that it was never started. The conversion must be triggered manually.
-
Finally, to download the file, the page sends a GET request to
/download/<sid>/<fid>/<convert_result>?rnd=<rnd>
. This link is generated in an anonymous function assigned as a click event handler:$("#" + data.fid + " div.plupload_file_button" + (thumbnail_clickable ? ", #" + data.fid + " .plupload_thumb" : "")).click(function() { downloadURI("download/" + data.sid + "/" + data.fid + "/" + data.convert_result + "?rnd=" + Math.random(), data.convert_result); });
And here's the source for
downloadURI
:function downloadURI(uri, name) { if (HTMLElement.prototype.click) { var link = document.createElement("a"); link.download = name; link.href = uri; link.style.display = "none"; document.body.appendChild(link); link.click(); setTimeout(function() { link.remove(); }, 500); } else { window.location.href = uri; } }
Example
curl
:curl http://pdf2doc.com/download/3sw4i3wpq25qm46s/testing/Test.doc
The response is, of course, the file itself. However, if
sid
,fid
, etc. are invalid, the server will send back a 500 Server Error response.
-
There are four main API endpoints:
/upload
/convert
/status
/download
-
Every endpoint uses a combination of a session ID (
sid
) and file ID (fid
) which are both generated client-side. -
There is no form of authentication.
-
There may be rate-limiting, but I don't expect this tool to be used so frequently by its users that rate-limiting actually becomes a problem.
I will be using Python 2.7 with the requests library to automate this process. In addition, I will utilize Tkinter and py2exe to package the software into a Windows executable with a GUI.