Fix #703: Recover from empty structure files in PDB_CACHE_DIR by sbliven · Pull Request #774 · biojava/biojava

sbliven · 2018-06-08T13:31:56Z

For a short period the ChemComp files didn't resolve, resulting in empty files getting cached by BioJava. The caching problem has been long fixed, but this PR should make it so BioJava can recover and if the user happened to use an old version of BioJava (e.g. older CE-Symm versions).

Fixes #703

log slow downloads upon start

The main change hasn't been implemented, so we want tests to fail. However, the tests exposed some NPE and IO exceptions. These are now fixed, so the tests fail in the expected manner. - Use ATP ligand, which is not covered by the ReducedChemCompProvider - Use the ReducedChemCompProvider as a fallback consistently, preventing null chemComp - Defensive parsing in SimpleMMcifConsumer - Robust test, doesn't require internet!

Files less than 40 bytes are deleted to allow for gzip headers. This addresses biojava#703

This is necessary when changing the cache path.

sbliven · 2018-06-08T13:36:42Z

...va-structure/src/main/java/org/biojava/nbio/structure/io/mmcif/DownloadChemCompProvider.java

-				return chemComp;
+				// May be null if the file was corrupt. Fall back on ReducedChemCompProvider in that case
+				if(chemComp != null) {
+					return chemComp;


This is a small change in behavior. If we can't read the downloaded chemcomp file, we now return a stub chemcomp rather than NULL. Previously the stub was returned for corrupt GZIP files (which threw an error) but not for malformed cif contents (which just returned null).

sbliven · 2018-06-08T13:37:58Z

...va-structure/src/main/java/org/biojava/nbio/structure/io/mmcif/DownloadChemCompProvider.java

 		// probably a network error happened. Try to use the ReducedChemCOmpProvider
-		ReducedChemCompProvider reduced = new ReducedChemCompProvider();
+		if( fallback == null) {
+			fallback = new ReducedChemCompProvider();


creating a ReducedChemCompProvider is cheap (and idempotent), but let's cache it anyways.

sbliven · 2018-06-08T13:39:05Z

biojava-structure/src/main/java/org/biojava/nbio/structure/io/util/FileDownloadUtils.java

+	 *
+	 * @param dir directory to delete
+	 */
+	public static void deleteDirectory(Path dir) throws IOException {


Surprisingly this isn't an nio method and I couldn't find one in BioJava

sbliven · 2018-06-08T13:40:16Z

biojava-structure/src/test/java/org/biojava/nbio/structure/align/util/AtomCacheTest.java

+			cache.setPath(tmpCache.toString());
+			cache.setCachePath(tmpCache.toString());
+			cache.setUseMmCif(true);
+			ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider(tmpCache.toString()));


It's annoying how much code is needed to change the cache dir.

sbliven · 2018-06-08T13:43:19Z

Tests pass locally, but there's some danger of system-dependent bugs here since I'm hacking the cache path for the two new tests.

BTW, this could be an example of how to use AtomCache in tests without incurring network access.

sbliven · 2018-06-08T13:46:45Z

This should be merged to master after merging

josemduarte

Thanks @sbliven !

I've submitted a few comments and questions

josemduarte · 2018-06-08T15:12:58Z

biojava-structure/src/main/java/org/biojava/nbio/structure/cath/CathInstallation.java


 	protected void downloadFileFromRemote(URL remoteURL, File localFile) throws IOException{
 //        System.out.println("downloading " + remoteURL + " to: " + localFile);
+		LOGGER.info("Downloaded file {} to local file {}", remoteURL, localFile);


Shouldn't this be "Downloading"? It isn't downloaded at this point yet

josemduarte · 2018-06-08T15:15:39Z

biojava-structure/src/main/java/org/biojava/nbio/structure/io/LocalPDBDirectory.java

+							if( ! success) {
+								return null;
+							}
+							assert(!f.exists());


I'd rather log a warning here. Assert is going to be ignored in normal situations

That's intended. If the delete was successful then the file will not exist. This is just a sanity check during development to make sure delete() wasn't asynchronous or something.

josemduarte · 2018-06-08T15:17:20Z

biojava-structure/src/main/java/org/biojava/nbio/structure/io/LocalPDBDirectory.java

+						if( f.length() < MIN_PDB_FILE_SIZE ) {
+							boolean success = f.delete();
+							if( ! success) {
+								return null;


Shouldn't this rather throw an exception (perhaps IOException)? I don't see how can this null be handled if file can't be deleted.

switched to Files.delete which throws the exception

josemduarte · 2018-06-08T15:19:25Z

biojava-structure/src/main/java/org/biojava/nbio/structure/io/mmcif/ChemCompGroupFactory.java

 	}
+
+	/**
+	 * Force the cache to be reset.


This clears the memory cache, right? Could you add that to javadoc, it's an important detail since there is both memory and file cache

Only memory. Documented.

josemduarte · 2018-06-08T15:35:16Z

...va-structure/src/main/java/org/biojava/nbio/structure/io/mmcif/DownloadChemCompProvider.java

+		// delete files that are too short to have contents
+		if( f.length() < LocalPDBDirectory.MIN_PDB_FILE_SIZE ) {
+			f.delete();
+			return false;


If the file fails to be deleted, this won't work as expected. I think the fail to delete case should throw an IOException

I don't check the status because the delete isn't really necessary here. If this method returns false then we re-download the file and move it on top of the old one. I just added the delete defensively.

josemduarte · 2018-06-08T15:36:55Z

...va-structure/src/main/java/org/biojava/nbio/structure/io/mmcif/DownloadChemCompProvider.java

+		}

-		return reduced.getChemComp(recordName);
+		return fallback.getChemComp(recordName);


Is there a warning logged when the fallback is used? That'd be important, so that users are aware and can investigate if there's something wrong in their code or file system

josemduarte · 2018-06-08T15:42:08Z

biojava-structure/src/main/java/org/biojava/nbio/structure/io/mmcif/SimpleMMcifConsumer.java

+			residueNrInt = Integer.parseInt(residueNumberS);
+		} else {
+			String label_seq_id = atom.getLabel_seq_id();
+			residueNrInt = Integer.parseInt(label_seq_id);


label_seq_id is not the same as residue number. This can introduce many problems.

We should discuss this further.

I added it because the atp.cif.gz file I added (from pymol) has label_seq_id but not auth_seq_id. It seems reasonable to me to fall back on the sequential numbering if the authors don't specify a custom numbering.

Note that auth_seq_id is optional but label_seq_id is required. From the spec it sounds to me like label_seq_id is a good fallback for ResidueNumber:

_atom_site.auth_seq_id:

An alternative identifier for _atom_site.label_seq_id that
may be provided by an author in order to match the identification
used in the publication that describes the structure.

Ideally BioJava would use Group.getId() rather then getResidueNumber() everywhere. However, many things still use residue numbers, so setting a default seems prudent.

There's another potential issue here, which is that the seq_id is stored as a long while ResidueNumber contains an Integer. I don't think there are any files with >2billion groups per entity, but if we hit it then the ResidueNumber would throw a NumberFormatException here. I think that's fine.

I started #775 to continue this discussion.

josemduarte · 2018-06-08T15:48:33Z

biojava-structure/src/main/java/org/biojava/nbio/structure/io/mmcif/SimpleMMcifParser.java

 		line = buf.readLine();
+		while( line != null && (line.isEmpty() || line.startsWith(COMMENT_CHAR))) {
+			line = buf.readLine();
+		}


Nice to handle comments in first line!

Is an empty first line also valid CIF?

I think blank lines are permitted anywhere in CIF, but I could be wrong. I try to write permissive parsers. It seems like it shouldn't break the spec so we might as well be robust to it either way.

josemduarte · 2018-06-08T15:51:02Z

biojava-structure/src/test/java/org/biojava/nbio/structure/align/util/AtomCacheTest.java

+			assertTrue(chem.getAtoms().size() > 0);
+			assertEquals("NON-POLYMER", chem.getType());
+		} finally {
+//			FileDownloadUtils.deleteDirectory(tmpCache);


Why commented? Is cleaning up not needed here?

If not needed please remove the try/finally

Fixed. Cleaning up /tmp is polite, but makes it harder to debug failing tests.

josemduarte · 2018-06-08T15:53:29Z

biojava-structure/src/test/java/org/biojava/nbio/structure/align/util/AtomCacheTest.java

+			assertTrue(chem.getAtoms().size() > 0);
+			assertEquals("NON-POLYMER", chem.getType());
+		} finally {
+//			FileDownloadUtils.deleteDirectory(tmpCache);


Why commented?

See comments on biojava#774

Use GlobalsHelper.pushState()/restoreState() before and after tests to ensure that state isn't carried between tests. This is applied to the AtomCacheTest to fix test regressions while simplifying the code.

Maven runs tests with a clean environment, so we can't restore PDB_DIR

sbliven · 2018-06-15T10:49:42Z

@josemduarte please check again and see if I addressed all your suggestions.

The GlobalsHelper class that I added is pretty clean, and it would be good to use it for more tests that hack paths and factory methods.

sbliven · 2018-06-15T11:06:57Z

Tests pass with an old cache, but 4LNC got updated Tuesday and now TestExperimentalTechniques.test4LNC fails on Travis.

4LNC was updated to remove the X-Ray experimental method. This switches the test to 6F2Q, which uses both Neutron & Xray.

josemduarte · 2018-06-15T18:17:19Z

biojava-structure/src/main/java/org/biojava/nbio/structure/io/mmcif/SimpleMMcifConsumer.java

 		String recordName    = atom.getGroup_PDB();
 		String residueNumberS = atom.getAuth_seq_id();
-		Integer residueNrInt = Integer.parseInt(residueNumberS);
+		Integer residueNrInt;


Could you handle this in a separate pull request? as per discussion in #775

Parsing such a file again throws a NumberFormatException. Further work/discussion of this issue is on biojava#775, but it was blocking the merging of biojava#774.

sbliven · 2018-06-18T11:49:06Z

OK, I removed the mmcif changes (#775) and merged the result.

sbliven added 6 commits June 4, 2018 17:21

Improving CathInstallation logging

5c1a335

log slow downloads upon start

Add test for biojava#703

4785d57

Support initial comments in MMCif files (e.g. those generated by PyMOL)

c3cec91

Re-download empty structure or chemical component structures

6ea9ae7

Files less than 40 bytes are deleted to allow for gzip headers. This addresses biojava#703

Set and restore the ChemCompGroupFactory singleton

62fb2a4

This is necessary when changing the cache path.

sbliven commented Jun 8, 2018

View reviewed changes

sbliven mentioned this pull request Jun 8, 2018

CE-Symm fails with 'does not look like a valid mmCIF file!' errors rcsb/symmetry#91

Closed

josemduarte requested changes Jun 8, 2018

View reviewed changes

sbliven added 3 commits June 12, 2018 15:54

Incorporating suggestions from @josemduarte

f5b325f

See comments on biojava#774

Add GlobalsHelper class to manage our global state

13f267c

Use GlobalsHelper.pushState()/restoreState() before and after tests to ensure that state isn't carried between tests. This is applied to the AtomCacheTest to fix test regressions while simplifying the code.

Fix NullPointer when restoring global state

c1c3c30

Maven runs tests with a clean environment, so we can't restore PDB_DIR

sbliven mentioned this pull request Jun 15, 2018

MMCif behavior when auth_seq_id is missing #775

Closed

Fix test failure due to PDB change

0c10a2e

4LNC was updated to remove the X-Ray experimental method. This switches the test to 6F2Q, which uses both Neutron & Xray.

josemduarte requested changes Jun 15, 2018

View reviewed changes

Revert fix for mmcif files without auth_seq_id column

08ab3e1

Parsing such a file again throws a NumberFormatException. Further work/discussion of this issue is on biojava#775, but it was blocking the merging of biojava#774.

sbliven merged commit 5a4a68c into biojava:bugfixes-4.2 Jun 18, 2018

Comments

Conversation

sbliven commented Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbliven commented Jun 8, 2018

Uh oh!

sbliven commented Jun 8, 2018

Uh oh!

josemduarte left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbliven commented Jun 15, 2018

Uh oh!

sbliven commented Jun 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbliven commented Jun 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sbliven commented Jun 8, 2018 •

edited

Loading