v0.5.6 - Performance Part 2 (Is that a new scan loop?)
Published on September 2, 2022 by Joseph Milazzo
For the past 3 months, I have been working tirelessly to rebuild our main scan loop, which is not only the most complicated part of Kavita but also the most critical. It translates files on your system into the series, volumes, and chapters you know in the UI. This goal of a new loop was so difficult, I quit 3 times, but finally made it through. The new loop is finally ready to share with you all, thanks to many of our nightly users helping test and provide feedback.
What is this new scan loop I speak of? The new scan loop is a new way (and a new set of requirements) to scan your disks and process files extremely efficiently. This changes Kavita to using a folder-based parsing mechanism rather than the previous mechanism where we would parse all files, group then perform DB operations in groups of 50. The reasoning for this drastic shift is that the old method failed to scale on users with massive libraries (we have one user with half a million files per stats) and would take an exceedingly long time for networked (rclone) users.
To dive in, we need to talk about some requirement changes. In order to utilize folder-based scanning, we have to remove the flexibility of users can put files anywhere they want and Kavita will group them for you. When I started, I chose this mechanism based on feedback from Plex and Jellyfin, but in reality most of our users have proper folder-based structures, so this new requirement change shouldn't affect many users.
Kavita expects all series to be in their own folder, so a library cannot have loose leaf files in it. You can still do /library root/manga/Publisher/Series A/, /library root/manga/Publisher/Series B/, just not /library root/my book.pdf. Kavita will now scan folder by folder an for each folder, group the files in and process them as a single series in a task (async). This task will do all DB operations and invoke Cover Gen and File Analysis (word counting) for just that series. This change is important because it will now process series-by-series and update your UI in that manner. Hence you should be able to more quickly see your series and use them completely than before on first scan. This however comes with a drawback due to increased I/O and trips to the DB, the library scan is slower (on first scan). For me, previously it took 30 mins to scan all files into the DB (without covers or word counts), on the new loop it takes 80 mins (but has covers and word counts). The main difference is on the old loop I had to wait at least 10 mins till the first chunk inserts, with the new loop, you don't have that issue. This also means the Chunk errors users sometimes receive should no longer be an issue and if there is an error, that series will tell you directly. You can find more information about the new Scan loop on the wiki.
With that said, this also allows some deep optimizations. Kavita will no longer do any I/O (except an attribute lookup on folders) if a folder hasn't changed since the last scan time. This single optimization means that my library scan can run in 5 seconds for my 660 series library, as it skips work on anything that hasn't changed. From the nightly testers, many users have seen drastic time reduction, esp our rclone users. This also allows for 2 enhancements, the ability to ignore certain files/folders (aka .kavitaignore) and the ability to hook into Kavita and say a folder has changed and let Kavita kick off a scan.
Next up is .kavitaignore, inspired from .plexignore and a feature request as well, users can now tell Kavita to ignore files and folders at any level of a scan. You can have at library root to have a global set of rules, like ignore cover.png, or you can have nested into your folders. You can have one for each folder and Kavita will intelligently combine the rules as it moves down the folder tree. This is another degree of freedom which allows you to not only filter, but speed up scans as well. This feature is experimental and still being tested and tweaked, but works in many cases. Give it a go and report back on issues faced or clever tricks that we can enhance our wiki with.
Lastly, we have folder watching. This new, experimental feature is really something cool. Why wait for nightly scans, when Kavita can pickup on changes and instantly run an appropriate scan? This new feature does just that. When a change is detected, a task is queued up and after 5 minutes executed. Any changes that related to that first task, will get disregarded. Please note, this is experimental and likely to not work if you download directly to your library. This feature will be fletched out further in later releases.
TLDR: There are new requirements for what Kavita expects from file layout. Read about it on the wiki. If you scan without conforming your layout, you may run into issues.
The release can be found here.
What is this new scan loop I speak of? The new scan loop is a new way (and a new set of requirements) to scan your disks and process files extremely efficiently. This changes Kavita to using a folder-based parsing mechanism rather than the previous mechanism where we would parse all files, group then perform DB operations in groups of 50. The reasoning for this drastic shift is that the old method failed to scale on users with massive libraries (we have one user with half a million files per stats) and would take an exceedingly long time for networked (rclone) users.
To dive in, we need to talk about some requirement changes. In order to utilize folder-based scanning, we have to remove the flexibility of users can put files anywhere they want and Kavita will group them for you. When I started, I chose this mechanism based on feedback from Plex and Jellyfin, but in reality most of our users have proper folder-based structures, so this new requirement change shouldn't affect many users.
Kavita expects all series to be in their own folder, so a library cannot have loose leaf files in it. You can still do /library root/manga/Publisher/Series A/, /library root/manga/Publisher/Series B/, just not /library root/my book.pdf. Kavita will now scan folder by folder an for each folder, group the files in and process them as a single series in a task (async). This task will do all DB operations and invoke Cover Gen and File Analysis (word counting) for just that series. This change is important because it will now process series-by-series and update your UI in that manner. Hence you should be able to more quickly see your series and use them completely than before on first scan. This however comes with a drawback due to increased I/O and trips to the DB, the library scan is slower (on first scan). For me, previously it took 30 mins to scan all files into the DB (without covers or word counts), on the new loop it takes 80 mins (but has covers and word counts). The main difference is on the old loop I had to wait at least 10 mins till the first chunk inserts, with the new loop, you don't have that issue. This also means the Chunk errors users sometimes receive should no longer be an issue and if there is an error, that series will tell you directly. You can find more information about the new Scan loop on the wiki.
With that said, this also allows some deep optimizations. Kavita will no longer do any I/O (except an attribute lookup on folders) if a folder hasn't changed since the last scan time. This single optimization means that my library scan can run in 5 seconds for my 660 series library, as it skips work on anything that hasn't changed. From the nightly testers, many users have seen drastic time reduction, esp our rclone users. This also allows for 2 enhancements, the ability to ignore certain files/folders (aka .kavitaignore) and the ability to hook into Kavita and say a folder has changed and let Kavita kick off a scan.
Next up is .kavitaignore, inspired from .plexignore and a feature request as well, users can now tell Kavita to ignore files and folders at any level of a scan. You can have at library root to have a global set of rules, like ignore cover.png, or you can have nested into your folders. You can have one for each folder and Kavita will intelligently combine the rules as it moves down the folder tree. This is another degree of freedom which allows you to not only filter, but speed up scans as well. This feature is experimental and still being tested and tweaked, but works in many cases. Give it a go and report back on issues faced or clever tricks that we can enhance our wiki with.
Lastly, we have folder watching. This new, experimental feature is really something cool. Why wait for nightly scans, when Kavita can pickup on changes and instantly run an appropriate scan? This new feature does just that. When a change is detected, a task is queued up and after 5 minutes executed. Any changes that related to that first task, will get disregarded. Please note, this is experimental and likely to not work if you download directly to your library. This feature will be fletched out further in later releases.
TLDR: There are new requirements for what Kavita expects from file layout. Read about it on the wiki. If you scan without conforming your layout, you may run into issues.
The release can be found here.
Added
- You can now have a .kavitaignore file at any location in your library and the scanner will read that and filter out any subsequent folder/files. This works very similar to .plexignore, this utilizes a Globbing syntax and supports multiple patterns via new lines. You can find more about the syntax here. This feature is experimental
- Added a new event type of Info which will show when Scan loops are aborted due to nothing changing on disk. This aims to inform the user that their expected action took no result.
- Added a clear all button on the events widget
- When a new series is added to Kavita, library detail pages will insert instantly (previously we reloaded the page every 3 seconds post an update coming in)
- Added --event-widget-info-bg-color for the new info event widget color
- When we navigate from a page that has a jump bar and the user has clicked the jump bar, resume their scroll position when returning to that page. This is not tracking scroll position naturally, that will be looked at later.
- Added / as a server route to swagger, which could help users behind a reverse proxy from using their local server.
- Kavita now comes out of the box with Folder Watching for libraries. This means that when any sort of event occurs, like new files, file/folder renames, deletions, etc, Kavita will queue a task to scan those changes into the system. By default, tasks will only be processed every 5 mins. This is an experimental feature and disabled by default.
- Kavita now comes with a new API for library/scan-folder and takes the API key (with admin permissions) and a folder path.
- Added API endpoint for a health check to streamline docker healthy status.
Changed
- Kavita now requires series to be within a folder and not spread between multiple folders. For example, a library root of /manga/ must have series self contained in /manga/series A/files, /manga/series B/files. Kavita will now ignore files that are in /manga/ alone.
- Rewrote the scan loop to perform folder-based scanning, Kavita will now process one folder at a time in a multi-threaded manner. This is a major change to how previously scanned, where all files had to be parsed before any DB work could be done. Users on networked systems should see a major improvement.
- Scanning tasks now run on their own dedicated queue and will not allow any concurrent runs. Time for a scan task to get executed is lengthened a bit as default queue takes priority.
- Added optimizations where Kavita will ignore scanning folders (all I/O) if nothing has changed on the disk since last scan of that folder.
- Fail silently if an email invite doesn't get sent, as the link is still present in the DB and UI screen.
- Ignore progress events altogether on Series cards so progress bars don't get skewed when a user is reading
- Reduced the spammy output for the logger, so more clear information is delivered, instead of internal DB queries.
- Deleting a series will now automatically remove it from the Want to Read page
- Tweaked a migration to output log only if it's going to be ran
- Added dedicated token output to logs for auth flows (when inviting a user) in case token is getting url encoded (but from testing, it doesn't affect)
- Reverted back to ES6 vs ES2020 so older Safari browsers can use Kavita
- When parsing non-epubs in Book library, use Manga parsing instead of Comic, to better support Light Novels
- OPDS description now no longer mentions users will be able to download, as they currently still need the download role.
- Updated all links that point to external sites to use noopener noreferrer
- Updated the Email Service wording to point to the Github for the external email service
- Clear cache will now clear temp directory as well, to clear out bookmark and downloads.
- When a library is in progress and a user tries to delete the library, tell them to wait before we let them
- Rearranged some code in invite user flow to help with the small fraction of users that can't invite users.
- Added a null check so we don't log an exception when connecting a websocket when user is not authenticated
- When a theme that is set for a user, gets removed from Kavita, inform the user to refresh.
- Manga reader Darkness control is now called Brightness
Fixed
- Fixed a bug where series with LocalizedSeries comicinfo tag could not get joined or joined in wrong order each scan, depending on which folders or files were processed first.
- Fixed a bug where collection images weren't loading
- Go back was not returning to the scroll position when you clicked the nested link, in the book reader.
- Fixed a bug when moving between pages and a page had smaller height, the total height calc would get skewed and allow you to scroll beneath the bottom action bar in the book reader.
- Fixed tooltips not rendering on series cards
- Fixed a missing update in Update Series, where we could change name of a series, but not recalculate the normalized name.
- Fixed a bug where archives with ._ files would be counted as valid files, when they are actually just metadata files from Macosx
- Removed some extra blank space on collection detail page when collection doesn't have a summary or cover image
- Fixed a bug where adding a new item to a reading list wasn't adding it at the end of the list
- Fixed an issue where collection page would re-render the same covers for multiple cards
- Fixed a missing margin-top which made the page extras drawer not render correctly and hence be uncloseable on small screens
- Fixed a bug with stacked modals not showing overlay
- API changes to support some changes in Tachiyomi around progress syncing
- Fixed an issue where Save Bookmarks as WebP could get reset to false when saving other server settings
- Fixed a bug where typeahead wouldn't automatically show results on relationship screen without an additional click
- Fixed an issue where Chrome was caching api responses when it shouldn't had, thus causing some apis to return invalid data after a modification
- When there is no data on collection detail page, make sure we show the no data design
- When resetting password, ensure the input is colored correctly
- After resetting password, Kavita threw an error despite it working.
- Fixed incorrect messaging on reset password page
- Fixed a bug where reset password would show the side nav button causing skewing on the page
- Fixed a bug where a series with a relationship would prevent the series from being deleted (or the library it resided in)
- Downloading theme files was previously not allowed by non-authenticated users due to some changes in the Hotfix
- Hide the All Series side nav item if there are no libraries exposed to the user
Removed
- Removed a migration of bookmarks from in the DB into external files. This was added for v0.5.0, 6 releases ago.
KavitaEmail
- Fixed a typo where Email service was using SenderAddress instead of Username for authentication
- Changed the messaging around if the user received the email but didn't request it themselves, to ignore to be more prominent on the email