How it works

Indexing

Bundle Scanner maintains an inverted index of NPM libraries which lets us search for libraries by the contents of their files. To create the index, Bundle Scanner picks out NPM libraries that are likely to be bundled into frontend Javascript bundles and downloads the most popular releases of those libraries. It walks through their files to pick certain tokens from the code - mostly literals and object keys - that remain the same after minification. Production Javascript bundles are usually minified, so only these tokens are used to match a library with its minified counterpart inside of a bundle.

Scanning

When a user requests to scan a website, Bundle Scanner scrapes all the Javascript bundles that are used on that specific URL. For each bundle, it picks out tokens that can be used to search through its library index. It then scores NPM releases based mainly on how many tokens it can find in the same order in both in the bundle and in the library code. Since bundlers change the order of code in ways that are hard to predict, it is unusual to find anything close to an exact match. Bundle Scanner then selects the best matching version of every library.

In a second step, Bundle Scanner identifies the specific segment of the bundle code that has been matched by each library and excludes libraries that match intervals that are already taken by other libraries with a better match score. Finally, libraries with a match score above a certain threshold are shown to the user as being part of the bundle.

In benchmark tests, approximately 15% of libraries that are actually inside the bundle are not identified, and around 5% of libraries identified are false positives. The goal is of course to get these percentages closer to zero. The biggest challenge is the large amounts of duplicated code on NPM. Many library authors show a strong devotion to the principles of copy-paste-driven development which makes it hard for Bundle Scanner to distinguishing between libraries.

Contact

Bundle Scanner is developed by me - Markus Englund. I'm a developer based in Gothenburg doing mostly Node.js and frontend development. If you have any questions you can email me at markus@englund.dev. You can check out some of my open source work on GitHub.