-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
creator as non-classic data object #161
base: master
Are you sure you want to change the base?
Conversation
One of the consequences of removing I suppose this is a good thing, since we don't have to store duplicate data about authors across libraries. But something to account for is how to deal with the deletion of obsolete creators. On every delete of an I think the second option is better, since the first option requires performing extra queries on every single DELETE Api call, which puts extra load on the dataserver |
Sorry, should've noted this earlier — we should just do what we do for (The empty |
`fieldMode` tinyint(1) unsigned DEFAULT NULL, | ||
PRIMARY KEY (`creatorID`), | ||
KEY `name` (`lastName`(7),`firstName`(6)) | ||
) ENGINE=InnoDB DEFAULT CHARSET=utf8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be a PHP script like the others that does a proper migration of existing data. It should set the shard to down
, wait 20 seconds, do the migration, and bring the shard up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I only kept it as .sql until now for convenience.
I'll have it wrapped into the same algorithm as this update, except with state=down
.
misc/master.sql
Outdated
@@ -184,6 +184,7 @@ CREATE TABLE `libraries` ( | |||
`lastUpdated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', | |||
`version` int(10) unsigned NOT NULL DEFAULT '0', | |||
`shardID` smallint(5) unsigned NOT NULL, | |||
`hasData` TINYINT( 1 ) NOT NULL DEFAULT '0', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add this, but it shouldn't be in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I push just this one line to master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Got it. So then, we completely drop the CREATE TABLE With a many-to-one relationship with the items table |
In my last commit, I got rid of |
php cript to update the database added copies of files before changes from this PR with old_ prefix zotero_autoload checks the shard and uses old files for shards that were not migrated yet
Added a php script to move data from To make it possible to migrate a few shards at a time, I added the php models without changes from this PR with an |
71a2a24
to
327f68e
Compare
Brief summary of changes up to this point:
I ran some benchmarking on a shard post-upgrade vs a shard before upgrade with profiler on check the performance of adding 5000 tags and then 5000 creators. Before: tags ~ 30 sec, creators ~ 30 sec. ~20,000 small DB queries that took 3-7 seconds. So the database queries are better (since I cut down the # of queries a lot) but there is some inefficiency elsewhere that slows things down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial notes
Zotero_Admin_DB::query("ALTER TABLE `itemTags` DROP CONSTRAINT `itemTags_ibfk_2`;", false, $shardID); | ||
|
||
// Create new itemTags table | ||
Zotero_Admin_DB::query("CREATE TABLE `itemTagsNew` ( `tagID` BIGINT UNSIGNED NOT NULL, `itemID` BIGINT UNSIGNED NOT NULL, `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL, `type` tinyint(1) unsigned NOT NULL DEFAULT '0', `version` int(10) unsigned NOT NULL DEFAULT '1', PRIMARY KEY (`tagID`, `itemID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;", false, $shardID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Why did you leave
version
here? Because of?since=
support for tags? I think we can just take that out — I suspect it's not used, or is used rarely enough that we can give a brief warning that it's going away. It's only mentioned as an aside in the Syncing documentation, and I'm not seeing any recent usage of it. -
I'd say the same with
tagID
as withcreatorID
above, though I'm having second thoughts there. We could decide that the data savings are worth keeping acreators
table with acreatorID
andlibraryID
(but still no dates, key, or version). We don't do that for item data values and we won't do it for creators, but tags are much more likely to be assigned to many items, so the extra data usage could be significant. We would have to deal with purging when a tag was removed or an item was deleted, though, so it might not be worth it. Maybe need to run some queries there on real data to see how much data we would save. -
There's no index on the tag here. We'll have to check the queries used for tag-based API searches to see if it can even use an index or if it will only use the libraryID and scan for tags. But it'd be the same with the
name
index onitemCreators
— we either need it for both or for neither. (Before, in separate tables, we had to do name-based lookups to actually get the creatorID/tagID, so the indexes were definitely used.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, there is a test that calls
/tags?newer=
, so I assumed we need to keep it around. If we don't want it - that's great, I'll remove it from here and from my mocha tests PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yeah, some context from real world data would certainly be helpful. Maybe if we could see the output of
select count(*) from itemTags
and divide it byselect count(*) from tags
, we'd get an idea of roughly manyitemTag
relationships there are for each tag. And then if we multiply that byselect SUM(length(name)) from tags
, we'd have a rough estimate of how much extra memory duplicating tag names would cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Queries related to tags that were affected are:
-
In
/items?tag=
where we select the itemIDs that fit the tag filterSELECT itemID FROM items JOIN itemTags USING (itemID) WHERE LOWER(itemTags.name) IN (?)
-
In
tags/?tag=
where we select tagIDs that fit the query before fetching their actual data and constructing objectsSELECT SQL_CALC_FOUND_ROWS DISTINCT tagID FROM itemTags JOIN items USING (itemID) WHERE libraryID=? AND
namein (?) AND
typein (?)
;
Currently, these do a full table scan, which is not good. There was a UNIQUE KEY uniqueTags (libraryID, name, type)
in the old creators table that was used for such queries. We can add KEY nameType (name, type)
to current itemTags
and these queries will use that key. I checked manually with explain
sql statement and it seems to work.
With creators, the name
key indeed does not do anything now because the creators never fetched by themselves and are only loaded as a part of an item via SELECT * FROM itemCreators WHERE itemID=?
.
To sum up, I guess we should add KEY nameType (name, type)
to itemTags and remove name
from creators?
@@ -330,7 +330,7 @@ CREATE TABLE `syncDeleteLogIDs` ( | |||
|
|||
CREATE TABLE `syncDeleteLogKeys` ( | |||
`libraryID` int(10) unsigned NOT NULL, | |||
`objectType` enum('collection','creator','item','relation','search','setting','tag','tagName') NOT NULL, | |||
`objectType` enum('collection','item','relation','search','setting','tag','tagName') NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag
should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need all the schema changes here
// Creators | ||
echo "Migrating creators\n"; | ||
// Drop foreign key constraint | ||
Zotero_Admin_DB::query("ALTER TABLE `itemCreators` DROP CONSTRAINT `itemCreators_ibfk_1`;", false, $shardID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No semicolons needed in SQL from PHP (throughout)
@@ -74,7 +89,7 @@ function zotero_autoload($className) { | |||
return; | |||
} | |||
} | |||
|
|||
$GLOBALS['updatedShards'] = [1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be empty in the committed code
model/Creators.inc.php
Outdated
public static function bulkDelete($libraryID, $itemID, $creatorOrdersArray) { | ||
$placeholders = implode(', ', array_fill(0, sizeOf($creatorOrdersArray), '?')); | ||
$sql = "DELETE FROM itemCreators WHERE itemID=? AND orderIndex IN ($placeholders)"; | ||
Zotero_DB::query($sql, array_merge([$itemID],$creatorOrdersArray), Zotero_Shards::getByLibraryID($libraryID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after comma
$names = array_map(function($tag) { | ||
return $tag->name; | ||
}, $tags); | ||
$existingTagsSql = "SELECT t.tagID, t.name, MAX(t.version) as `version` from itemTags t JOIN items i USING (itemID) WHERE libraryID = ? AND name IN ($placeholders) GROUP BY tagID, `name`;"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- House style: Capitalized acronyms in variables:
existingTagsSql
→existingTagsSQL
- House style: Capitalized table aliases in SQL:
itemTags t
→itemTags IT
,items i
→items I
- Remove extra space after comma
- Only need backticks for reserved words (e.g.,
groups
)
controllers/ItemsController.php
Outdated
// Use a real tag name, in case case differs | ||
if (!$title) { | ||
$title = "Items of Tag ‘" . $tag->name . "’"; | ||
if (in_array($GLOBALS['shardID'], $GLOBALS['updatedShards']) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to have this same check everywhere. You can just do it once in header.inc.php and define()
a constant that's available globally (like, e.g., Z_ENV_BASE_PATH
).
Zotero_Admin_DB::query("ALTER TABLE `itemCreators` DROP CONSTRAINT `itemCreators_ibfk_2`;", false, $shardID); | ||
|
||
// Create new itemCreators table | ||
Zotero_Admin_DB::query("CREATE TABLE `itemCreatorsNew` ( `creatorID` BIGINT UNSIGNED NOT NULL, `itemID` BIGINT UNSIGNED NOT NULL, `firstName` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `lastName` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `fieldMode` tinyint(1) UNSIGNED DEFAULT NULL, `creatorTypeID` smallint(5) UNSIGNED NOT NULL, `orderIndex` smallint(5) UNSIGNED NOT NULL, PRIMARY KEY (`creatorID`, `itemID`), KEY `creatorTypeID` (`creatorTypeID`), KEY `name` (`lastName`(7),`firstName`(6)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;", false, $shardID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I led you astray originally, and didn't clarify later. I originally said there'd still be creatorID
in creators
, but since we're inlining the data in itemCreators
, there's no reason for there to be a creatorID
at all — it's not serving any purpose. And the primary key should be on itemID, orderIndex
, since that's what's unique. (It was actually wrong to begin with — it should've been on itemID, orderIndex
before like in the client, not itemID, creatorID, orderIndex
.)
load()
-ing of Zotero_Creators. An object is constructed with data right away.join
inloadCreators
to load all creators on GET requests, instead of sequential$item->load()
.$creator->save()
calls on create or update